The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/site-packages (1.5.12) Requirement already satisfied: tqdm in /usr/local/lib/python3.7/site-packages (from kaggle) (4.62.1) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/site-packages (from kaggle) (1.15.0) Requirement already satisfied: requests in /usr/local/lib/python3.7/site-packages (from kaggle) (2.25.1) Requirement already satisfied: certifi in /usr/local/lib/python3.7/site-packages (from kaggle) (2021.5.30) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/site-packages (from kaggle) (2.8.2) Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/site-packages (from kaggle) (1.26.6) Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/site-packages (from kaggle) (5.0.2) Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests->kaggle) (2.10) Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests->kaggle) (4.0.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available. You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
!pwd
/root/shared/Documents/Indiana Coursework/INFO-I 526 Applied ML/AML-project
!mkdir ~/.kaggle
!cp /root/shared/Downloads/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
! kaggle competitions files home-credit-default-risk
name size creationDate ---------------------------------- ----- ------------------- sample_submission.csv 524KB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 credit_card_balance.csv 405MB 2019-12-11 02:55:35 application_test.csv 25MB 2019-12-11 02:55:35 installments_payments.csv 690MB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 previous_application.csv 386MB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
# 
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIR# DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
DATA_DIR = "../Data/home-credit-default-risk" #same level as course repo in the data directory
print(DATA_DIR)
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
../Data/home-credit-default-risk mkdir: cannot create directory ‘../Data/home-credit-default-risk’: File exists
!ls -l $DATA_DIR
total 3326068 -rw-rw-r-- 1 root root 37383 Dec 11 2019 HomeCredit_columns_description.csv -rw-rw-r-- 1 root root 392703158 Dec 11 2019 POS_CASH_balance.csv -rw-rw-r-- 1 root root 26567651 Dec 11 2019 application_test.csv -rw-rw-r-- 1 root root 166133370 Dec 11 2019 application_train.csv -rw-rw-r-- 1 root root 170016717 Dec 11 2019 bureau.csv -rw-rw-r-- 1 root root 375592889 Dec 11 2019 bureau_balance.csv -rw-rw-r-- 1 root root 424582605 Dec 11 2019 credit_card_balance.csv -rw-r--r-- 1 root root 721616255 Nov 10 07:15 home-credit-default-risk.zip -rw-rw-r-- 1 root root 723118349 Dec 11 2019 installments_payments.csv -rw-rw-r-- 1 root root 404973293 Dec 11 2019 previous_application.csv -rw-rw-r-- 1 root root 536202 Dec 11 2019 sample_submission.csv
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
home-credit-default-risk.zip: Skipping, found more recently modified local copy (use --force to force download)
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = False
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile('application_train.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('application_test.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('credit_card_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('installments_payments.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('POS_CASH_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('previous_application.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
(307511, 122)
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: user 35.3 s, sys: 21.5 s, total: 56.7 s Wall time: 1min 22s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
print('\033[1m' + "Size of each dataset : " + '\033[0m' , end = '\n' * 2)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]:4}]')
Size of each dataset :
dataset application_train : [ 307,511, 122]
dataset application_test : [ 48,744, 121]
dataset bureau : [ 1,716,428, 17]
dataset bureau_balance : [ 27,299,925, 3]
dataset credit_card_balance : [ 3,840,312, 23]
dataset installments_payments : [ 13,605,401, 8]
dataset previous_application : [ 1,670,214, 37]
dataset POS_CASH_balance : [ 10,001,358, 8]
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
11 rows × 122 columns
from IPython.display import display, HTML
pd.set_option("display.max_rows", None, "display.max_columns", None)
# Full stats
def stats_summary1(df, df_name):
print(datasets[df_name].info(verbose=True, null_counts=True ))
print("-----"*15)
print(f"Shape of the df {df_name} is {df.shape} \n")
print("-----"*15)
print(f"Statistical summary of {df_name} is :")
print("-----"*15)
print(f"Description of the df {df_name}:\n")
print(display(HTML(np.round(datasets['application_train'].describe(),2).to_html())))
#print(f"Description of the df {df_name}:\n",np.round(datasets['application_train'].describe(),2))
def stats_summary2(df, df_name):
print(f"Description of the df continued for {df_name}:\n")
print("-----"*15)
print("Data type value counts: \n",df.dtypes.value_counts())
print("\nReturn number of unique elements in the object. \n")
print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
print("-----"*15)
print(f"Categorical and Numerical(int + float) features of {df_name}.")
print("-----"*15)
print()
for k, v in df_dtypes.items():
print({k.name: v})
print("---"*10)
print("\n \n")
# Null data list and plot.
def null_data_plot(df, df_name):
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = df.isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
print("-----"*15)
print("-----"*15)
print('\n The Missing Data: \n')
# display(missing_data) # display few
if len(missing_data)==0:
print("No missing Data")
else:
display(HTML(missing_data.to_html())) # display all the rows
print("-----"*15)
if len(df.columns)> 35:
f,ax =plt.subplots(figsize=(8,15))
else:
f,ax =plt.subplots()
#plt.xticks(rotation='90')
#fig=sns.barplot(missing_data.index, missing_data["Percent"],alpha=0.8)
#plt.xlabel('Features', fontsize=15)
#plt.ylabel('Percent of missing values', fontsize=15)
plt.title(f'Percent missing data for {df_name}.', fontsize=10)
fig=sns.barplot(missing_data["Percent"],missing_data.index ,alpha=0.8)
plt.xlabel('Percent of missing values', fontsize=10)
plt.ylabel('Features', fontsize=10)
plt.show()
return missing_data
# Full consolidation of all the stats function.
def display_stats(df, df_name):
print("--"*40)
print(" "*20 + '\033[1m'+ df_name + '\033[0m' +" "*20)
print("--"*40)
stats_summary1(df, df_name)
def display_feature_info(df, df_name):
stats_summary2(df, df_name)
feature_datatypes_groups(df, df_name)
null_data_plot(df, df_name)
display_stats(datasets['application_train'], 'application_train')
--------------------------------------------------------------------------------
application_train
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 307511 non-null int64
1 TARGET 307511 non-null int64
2 NAME_CONTRACT_TYPE 307511 non-null object
3 CODE_GENDER 307511 non-null object
4 FLAG_OWN_CAR 307511 non-null object
5 FLAG_OWN_REALTY 307511 non-null object
6 CNT_CHILDREN 307511 non-null int64
7 AMT_INCOME_TOTAL 307511 non-null float64
8 AMT_CREDIT 307511 non-null float64
9 AMT_ANNUITY 307499 non-null float64
10 AMT_GOODS_PRICE 307233 non-null float64
11 NAME_TYPE_SUITE 306219 non-null object
12 NAME_INCOME_TYPE 307511 non-null object
13 NAME_EDUCATION_TYPE 307511 non-null object
14 NAME_FAMILY_STATUS 307511 non-null object
15 NAME_HOUSING_TYPE 307511 non-null object
16 REGION_POPULATION_RELATIVE 307511 non-null float64
17 DAYS_BIRTH 307511 non-null int64
18 DAYS_EMPLOYED 307511 non-null int64
19 DAYS_REGISTRATION 307511 non-null float64
20 DAYS_ID_PUBLISH 307511 non-null int64
21 OWN_CAR_AGE 104582 non-null float64
22 FLAG_MOBIL 307511 non-null int64
23 FLAG_EMP_PHONE 307511 non-null int64
24 FLAG_WORK_PHONE 307511 non-null int64
25 FLAG_CONT_MOBILE 307511 non-null int64
26 FLAG_PHONE 307511 non-null int64
27 FLAG_EMAIL 307511 non-null int64
28 OCCUPATION_TYPE 211120 non-null object
29 CNT_FAM_MEMBERS 307509 non-null float64
30 REGION_RATING_CLIENT 307511 non-null int64
31 REGION_RATING_CLIENT_W_CITY 307511 non-null int64
32 WEEKDAY_APPR_PROCESS_START 307511 non-null object
33 HOUR_APPR_PROCESS_START 307511 non-null int64
34 REG_REGION_NOT_LIVE_REGION 307511 non-null int64
35 REG_REGION_NOT_WORK_REGION 307511 non-null int64
36 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64
37 REG_CITY_NOT_LIVE_CITY 307511 non-null int64
38 REG_CITY_NOT_WORK_CITY 307511 non-null int64
39 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64
40 ORGANIZATION_TYPE 307511 non-null object
41 EXT_SOURCE_1 134133 non-null float64
42 EXT_SOURCE_2 306851 non-null float64
43 EXT_SOURCE_3 246546 non-null float64
44 APARTMENTS_AVG 151450 non-null float64
45 BASEMENTAREA_AVG 127568 non-null float64
46 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64
47 YEARS_BUILD_AVG 103023 non-null float64
48 COMMONAREA_AVG 92646 non-null float64
49 ELEVATORS_AVG 143620 non-null float64
50 ENTRANCES_AVG 152683 non-null float64
51 FLOORSMAX_AVG 154491 non-null float64
52 FLOORSMIN_AVG 98869 non-null float64
53 LANDAREA_AVG 124921 non-null float64
54 LIVINGAPARTMENTS_AVG 97312 non-null float64
55 LIVINGAREA_AVG 153161 non-null float64
56 NONLIVINGAPARTMENTS_AVG 93997 non-null float64
57 NONLIVINGAREA_AVG 137829 non-null float64
58 APARTMENTS_MODE 151450 non-null float64
59 BASEMENTAREA_MODE 127568 non-null float64
60 YEARS_BEGINEXPLUATATION_MODE 157504 non-null float64
61 YEARS_BUILD_MODE 103023 non-null float64
62 COMMONAREA_MODE 92646 non-null float64
63 ELEVATORS_MODE 143620 non-null float64
64 ENTRANCES_MODE 152683 non-null float64
65 FLOORSMAX_MODE 154491 non-null float64
66 FLOORSMIN_MODE 98869 non-null float64
67 LANDAREA_MODE 124921 non-null float64
68 LIVINGAPARTMENTS_MODE 97312 non-null float64
69 LIVINGAREA_MODE 153161 non-null float64
70 NONLIVINGAPARTMENTS_MODE 93997 non-null float64
71 NONLIVINGAREA_MODE 137829 non-null float64
72 APARTMENTS_MEDI 151450 non-null float64
73 BASEMENTAREA_MEDI 127568 non-null float64
74 YEARS_BEGINEXPLUATATION_MEDI 157504 non-null float64
75 YEARS_BUILD_MEDI 103023 non-null float64
76 COMMONAREA_MEDI 92646 non-null float64
77 ELEVATORS_MEDI 143620 non-null float64
78 ENTRANCES_MEDI 152683 non-null float64
79 FLOORSMAX_MEDI 154491 non-null float64
80 FLOORSMIN_MEDI 98869 non-null float64
81 LANDAREA_MEDI 124921 non-null float64
82 LIVINGAPARTMENTS_MEDI 97312 non-null float64
83 LIVINGAREA_MEDI 153161 non-null float64
84 NONLIVINGAPARTMENTS_MEDI 93997 non-null float64
85 NONLIVINGAREA_MEDI 137829 non-null float64
86 FONDKAPREMONT_MODE 97216 non-null object
87 HOUSETYPE_MODE 153214 non-null object
88 TOTALAREA_MODE 159080 non-null float64
89 WALLSMATERIAL_MODE 151170 non-null object
90 EMERGENCYSTATE_MODE 161756 non-null object
91 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64
92 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64
93 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64
94 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64
95 DAYS_LAST_PHONE_CHANGE 307510 non-null float64
96 FLAG_DOCUMENT_2 307511 non-null int64
97 FLAG_DOCUMENT_3 307511 non-null int64
98 FLAG_DOCUMENT_4 307511 non-null int64
99 FLAG_DOCUMENT_5 307511 non-null int64
100 FLAG_DOCUMENT_6 307511 non-null int64
101 FLAG_DOCUMENT_7 307511 non-null int64
102 FLAG_DOCUMENT_8 307511 non-null int64
103 FLAG_DOCUMENT_9 307511 non-null int64
104 FLAG_DOCUMENT_10 307511 non-null int64
105 FLAG_DOCUMENT_11 307511 non-null int64
106 FLAG_DOCUMENT_12 307511 non-null int64
107 FLAG_DOCUMENT_13 307511 non-null int64
108 FLAG_DOCUMENT_14 307511 non-null int64
109 FLAG_DOCUMENT_15 307511 non-null int64
110 FLAG_DOCUMENT_16 307511 non-null int64
111 FLAG_DOCUMENT_17 307511 non-null int64
112 FLAG_DOCUMENT_18 307511 non-null int64
113 FLAG_DOCUMENT_19 307511 non-null int64
114 FLAG_DOCUMENT_20 307511 non-null int64
115 FLAG_DOCUMENT_21 307511 non-null int64
116 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64
117 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64
118 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64
119 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64
120 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64
121 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
---------------------------------------------------------------------------
Shape of the df application_train is (307511, 122)
---------------------------------------------------------------------------
Statistical summary of application_train is :
---------------------------------------------------------------------------
Description of the df application_train:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['application_train'], 'application_train')
Description of the df continued for application_train:
---------------------------------------------------------------------------
Data type value counts:
float64 65
int64 41
object 16
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of application_train.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
| BASEMENTAREA_MEDI | 58.52 | 179943 |
| BASEMENTAREA_AVG | 58.52 | 179943 |
| BASEMENTAREA_MODE | 58.52 | 179943 |
| EXT_SOURCE_1 | 56.38 | 173378 |
| NONLIVINGAREA_MODE | 55.18 | 169682 |
| NONLIVINGAREA_AVG | 55.18 | 169682 |
| NONLIVINGAREA_MEDI | 55.18 | 169682 |
| ELEVATORS_MEDI | 53.30 | 163891 |
| ELEVATORS_AVG | 53.30 | 163891 |
| ELEVATORS_MODE | 53.30 | 163891 |
| WALLSMATERIAL_MODE | 50.84 | 156341 |
| APARTMENTS_MEDI | 50.75 | 156061 |
| APARTMENTS_AVG | 50.75 | 156061 |
| APARTMENTS_MODE | 50.75 | 156061 |
| ENTRANCES_MEDI | 50.35 | 154828 |
| ENTRANCES_AVG | 50.35 | 154828 |
| ENTRANCES_MODE | 50.35 | 154828 |
| LIVINGAREA_AVG | 50.19 | 154350 |
| LIVINGAREA_MODE | 50.19 | 154350 |
| LIVINGAREA_MEDI | 50.19 | 154350 |
| HOUSETYPE_MODE | 50.18 | 154297 |
| FLOORSMAX_MODE | 49.76 | 153020 |
| FLOORSMAX_MEDI | 49.76 | 153020 |
| FLOORSMAX_AVG | 49.76 | 153020 |
| YEARS_BEGINEXPLUATATION_MODE | 48.78 | 150007 |
| YEARS_BEGINEXPLUATATION_MEDI | 48.78 | 150007 |
| YEARS_BEGINEXPLUATATION_AVG | 48.78 | 150007 |
| TOTALAREA_MODE | 48.27 | 148431 |
| EMERGENCYSTATE_MODE | 47.40 | 145755 |
| OCCUPATION_TYPE | 31.35 | 96391 |
| EXT_SOURCE_3 | 19.83 | 60965 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_DAY | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_MON | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_QRT | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 13.50 | 41519 |
| NAME_TYPE_SUITE | 0.42 | 1292 |
| OBS_30_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| DEF_30_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| OBS_60_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| DEF_60_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| EXT_SOURCE_2 | 0.21 | 660 |
| AMT_GOODS_PRICE | 0.09 | 278 |
---------------------------------------------------------------------------
We can see from the descriptive statistics for Days Birth, Days employed, Days registration, Days Id publish which is a negative value and is not expected.
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
Explore the distribution of values taken on by the target variable.
datasets["application_train"].groupby(['TARGET'])['SK_ID_CURR'].count()
TARGET 0 282686 1 24825 Name: SK_ID_CURR, dtype: int64
datasets["application_train"]['TARGET'].plot.hist()
plt.show()
datasets["application_train"]['DAYS_EMPLOYED'].describe()
count 307511.000000 mean 63815.045904 std 141275.766519 min -17912.000000 25% -2760.000000 50% -1213.000000 75% -289.000000 max 365243.000000 Name: DAYS_EMPLOYED, dtype: float64
df_app_train=datasets["application_train"].copy()
df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
plt.hist(df_app_train['DAYS_EMPLOYED'],edgecolor = 'k', bins = 25)
plt.title('DAYS_EMPLOYED'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
plt.show()
Number of Days employed is an important feature that can be used for predicting risk. However, the histogram shows that the data is not logical.
plt.hist(datasets["application_train"]['OWN_CAR_AGE'],edgecolor = 'k', bins = 25)
plt.title('OWN CAR AGE'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
plt.show()
There are number of applications that we can see from the histogram for those who have cars over 60 years old.
Application Train dataset contains most of the details with respect to loan requests. There are many missing values and this can be a matter of concern in this dataset and we need to impute these missing values. Occupation Type and Organization Type are categorical values that have 58 and 18 categories respectively. This can be useful in feature engineering.
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
plt.show()
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
plt.show()
sns.histplot(x='AMT_CREDIT', data=datasets["application_train"], bins=50);
plt.title('Distribution of AMT_CREDIT');
plt.show()
#Credit amounts are right skewed and outlier exists
fig, ax = plt.subplots(figsize=(10, 10))
f = sns.scatterplot(data = datasets["application_train"], x = 'AMT_INCOME_TOTAL', y = 'AMT_CREDIT', hue = 'TARGET')
f.set(xlim=(0, 1000000))
plt.show()
plt.subplots(figsize=(15, 15))
d = sns.boxplot(x = datasets["application_train"]['NAME_EDUCATION_TYPE'],
y = datasets["application_train"]['AMT_CREDIT'],
hue = datasets["application_train"]['NAME_FAMILY_STATUS'], palette="Set3")
d.set(ylim=(0, 2000000))
plt.xticks(rotation=90)
plt.show()
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
pos = datasets["application_train"][['TARGET','DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_BIRTH']]
pos_corr = pos.corr()
sns.heatmap(pos_corr, annot = True, cmap='viridis')
plt.show()
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64
print('\nMost Negative Correlations:\n', correlations.head(10))
neg = datasets["application_train"][['TARGET','EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_EMPLOYED']]
neg_corr = neg.corr()
sns.heatmap(neg_corr, annot = True, cmap='viridis')
plt.show()
Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
The distribution of the top correlated features are plotted below.
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(15,20))
for i,var in enumerate(var_neg_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,4,i+1)
datasets["application_train"][var].hist()
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
Density plots of correlated features are plotted below
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,3,i+1)
plt.subplots_adjust(wspace=2)
sns.kdeplot(dflt_var,label='Default')
sns.kdeplot(dflt_non_var,label='No Default')
#plt.xlabel(var)
plt.ylabel('Density')
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
datasets["application_test"].shape
(48744, 121)
datasets["application_train"].shape
(307511, 122)
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
appsDF.shape
(1670214, 37)
len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"]))
47800
print(f"There are {appsDF.shape[0]:,} previous applications")
There are 1,670,214 previous applications
# How many entries are there for each month?
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
#prevAppCounts
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
101
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
sum(appsDF['SK_ID_CURR'].value_counts()==1)
60458
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
plt.show()
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 41.76895 Percentage with 40 or more previous apps: 0.03453
display_stats(datasets['bureau'], 'bureau')
--------------------------------------------------------------------------------
bureau
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 1716428 non-null int64
1 SK_ID_BUREAU 1716428 non-null int64
2 CREDIT_ACTIVE 1716428 non-null object
3 CREDIT_CURRENCY 1716428 non-null object
4 DAYS_CREDIT 1716428 non-null int64
5 CREDIT_DAY_OVERDUE 1716428 non-null int64
6 DAYS_CREDIT_ENDDATE 1610875 non-null float64
7 DAYS_ENDDATE_FACT 1082775 non-null float64
8 AMT_CREDIT_MAX_OVERDUE 591940 non-null float64
9 CNT_CREDIT_PROLONG 1716428 non-null int64
10 AMT_CREDIT_SUM 1716415 non-null float64
11 AMT_CREDIT_SUM_DEBT 1458759 non-null float64
12 AMT_CREDIT_SUM_LIMIT 1124648 non-null float64
13 AMT_CREDIT_SUM_OVERDUE 1716428 non-null float64
14 CREDIT_TYPE 1716428 non-null object
15 DAYS_CREDIT_UPDATE 1716428 non-null int64
16 AMT_ANNUITY 489637 non-null float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau is (1716428, 17)
---------------------------------------------------------------------------
Statistical summary of bureau is :
---------------------------------------------------------------------------
Description of the df bureau:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['bureau'], 'bureau')
Description of the df continued for bureau:
---------------------------------------------------------------------------
Data type value counts:
float64 8
int64 6
object 3
dtype: int64
Return number of unique elements in the object.
CREDIT_ACTIVE 4
CREDIT_CURRENCY 4
CREDIT_TYPE 15
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of bureau.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
'CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE'],
dtype='object')}
------------------------------
{'float64': Index(['DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE',
'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
'AMT_CREDIT_SUM_OVERDUE', 'AMT_ANNUITY'],
dtype='object')}
------------------------------
{'object': Index(['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| AMT_ANNUITY | 71.47 | 1226791 |
| AMT_CREDIT_MAX_OVERDUE | 65.51 | 1124488 |
| DAYS_ENDDATE_FACT | 36.92 | 633653 |
| AMT_CREDIT_SUM_LIMIT | 34.48 | 591780 |
| AMT_CREDIT_SUM_DEBT | 15.01 | 257669 |
| DAYS_CREDIT_ENDDATE | 6.15 | 105553 |
---------------------------------------------------------------------------
display_stats(datasets['bureau_balance'], 'bureau_balance')
--------------------------------------------------------------------------------
bureau_balance
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_BUREAU 27299925 non-null int64
1 MONTHS_BALANCE 27299925 non-null int64
2 STATUS 27299925 non-null object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau_balance is (27299925, 3)
---------------------------------------------------------------------------
Statistical summary of bureau_balance is :
---------------------------------------------------------------------------
Description of the df bureau_balance:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['bureau_balance'], 'bureau_balance')
Description of the df continued for bureau_balance:
---------------------------------------------------------------------------
Data type value counts:
int64 2
object 1
dtype: int64
Return number of unique elements in the object.
STATUS 8
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of bureau_balance.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_BUREAU', 'MONTHS_BALANCE'], dtype='object')}
------------------------------
{'object': Index(['STATUS'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
No missing Data
As we can see Bureau Balance does not have any missing values. Bureau has some percentage of missing data as plottee above. Bureau and Bureau Balance can be used to provide accurate aggregate features.
display_stats(datasets['credit_card_balance'], 'credit_card_balance')
--------------------------------------------------------------------------------
credit_card_balance
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 3840312 non-null int64
1 SK_ID_CURR 3840312 non-null int64
2 MONTHS_BALANCE 3840312 non-null int64
3 AMT_BALANCE 3840312 non-null float64
4 AMT_CREDIT_LIMIT_ACTUAL 3840312 non-null int64
5 AMT_DRAWINGS_ATM_CURRENT 3090496 non-null float64
6 AMT_DRAWINGS_CURRENT 3840312 non-null float64
7 AMT_DRAWINGS_OTHER_CURRENT 3090496 non-null float64
8 AMT_DRAWINGS_POS_CURRENT 3090496 non-null float64
9 AMT_INST_MIN_REGULARITY 3535076 non-null float64
10 AMT_PAYMENT_CURRENT 3072324 non-null float64
11 AMT_PAYMENT_TOTAL_CURRENT 3840312 non-null float64
12 AMT_RECEIVABLE_PRINCIPAL 3840312 non-null float64
13 AMT_RECIVABLE 3840312 non-null float64
14 AMT_TOTAL_RECEIVABLE 3840312 non-null float64
15 CNT_DRAWINGS_ATM_CURRENT 3090496 non-null float64
16 CNT_DRAWINGS_CURRENT 3840312 non-null int64
17 CNT_DRAWINGS_OTHER_CURRENT 3090496 non-null float64
18 CNT_DRAWINGS_POS_CURRENT 3090496 non-null float64
19 CNT_INSTALMENT_MATURE_CUM 3535076 non-null float64
20 NAME_CONTRACT_STATUS 3840312 non-null object
21 SK_DPD 3840312 non-null int64
22 SK_DPD_DEF 3840312 non-null int64
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
---------------------------------------------------------------------------
Shape of the df credit_card_balance is (3840312, 23)
---------------------------------------------------------------------------
Statistical summary of credit_card_balance is :
---------------------------------------------------------------------------
Description of the df credit_card_balance:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['credit_card_balance'], 'credit_card_balance')
Description of the df continued for credit_card_balance:
---------------------------------------------------------------------------
Data type value counts:
float64 15
int64 7
object 1
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_STATUS 7
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of credit_card_balance.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'AMT_CREDIT_LIMIT_ACTUAL',
'CNT_DRAWINGS_CURRENT', 'SK_DPD', 'SK_DPD_DEF'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_BALANCE', 'AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT',
'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_POS_CURRENT',
'AMT_INST_MIN_REGULARITY', 'AMT_PAYMENT_CURRENT',
'AMT_PAYMENT_TOTAL_CURRENT', 'AMT_RECEIVABLE_PRINCIPAL',
'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE', 'CNT_DRAWINGS_ATM_CURRENT',
'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT',
'CNT_INSTALMENT_MATURE_CUM'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| AMT_PAYMENT_CURRENT | 20.00 | 767988 |
| AMT_DRAWINGS_ATM_CURRENT | 19.52 | 749816 |
| CNT_DRAWINGS_POS_CURRENT | 19.52 | 749816 |
| AMT_DRAWINGS_OTHER_CURRENT | 19.52 | 749816 |
| AMT_DRAWINGS_POS_CURRENT | 19.52 | 749816 |
| CNT_DRAWINGS_OTHER_CURRENT | 19.52 | 749816 |
| CNT_DRAWINGS_ATM_CURRENT | 19.52 | 749816 |
| CNT_INSTALMENT_MATURE_CUM | 7.95 | 305236 |
| AMT_INST_MIN_REGULARITY | 7.95 | 305236 |
---------------------------------------------------------------------------
display_stats(datasets['installments_payments'], 'installments_payments')
--------------------------------------------------------------------------------
installments_payments
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 13605401 non-null int64
1 SK_ID_CURR 13605401 non-null int64
2 NUM_INSTALMENT_VERSION 13605401 non-null float64
3 NUM_INSTALMENT_NUMBER 13605401 non-null int64
4 DAYS_INSTALMENT 13605401 non-null float64
5 DAYS_ENTRY_PAYMENT 13602496 non-null float64
6 AMT_INSTALMENT 13605401 non-null float64
7 AMT_PAYMENT 13602496 non-null float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None
---------------------------------------------------------------------------
Shape of the df installments_payments is (13605401, 8)
---------------------------------------------------------------------------
Statistical summary of installments_payments is :
---------------------------------------------------------------------------
Description of the df installments_payments:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['installments_payments'], 'installments_payments')
Description of the df continued for installments_payments:
---------------------------------------------------------------------------
Data type value counts:
float64 5
int64 3
dtype: int64
Return number of unique elements in the object.
Series([], dtype: float64)
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of installments_payments.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_NUMBER'], dtype='object')}
------------------------------
{'float64': Index(['NUM_INSTALMENT_VERSION', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT',
'AMT_INSTALMENT', 'AMT_PAYMENT'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| DAYS_ENTRY_PAYMENT | 0.02 | 2905 |
| AMT_PAYMENT | 0.02 | 2905 |
---------------------------------------------------------------------------
display_stats(datasets['POS_CASH_balance'], 'POS_CASH_balance')
--------------------------------------------------------------------------------
POS_CASH_balance
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 10001358 non-null int64
1 SK_ID_CURR 10001358 non-null int64
2 MONTHS_BALANCE 10001358 non-null int64
3 CNT_INSTALMENT 9975287 non-null float64
4 CNT_INSTALMENT_FUTURE 9975271 non-null float64
5 NAME_CONTRACT_STATUS 10001358 non-null object
6 SK_DPD 10001358 non-null int64
7 SK_DPD_DEF 10001358 non-null int64
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
---------------------------------------------------------------------------
Shape of the df POS_CASH_balance is (10001358, 8)
---------------------------------------------------------------------------
Statistical summary of POS_CASH_balance is :
---------------------------------------------------------------------------
Description of the df POS_CASH_balance:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['POS_CASH_balance'], 'POS_CASH_balance')
Description of the df continued for POS_CASH_balance:
---------------------------------------------------------------------------
Data type value counts:
int64 5
float64 2
object 1
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_STATUS 9
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of POS_CASH_balance.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'SK_DPD', 'SK_DPD_DEF'], dtype='object')}
------------------------------
{'float64': Index(['CNT_INSTALMENT', 'CNT_INSTALMENT_FUTURE'], dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| CNT_INSTALMENT_FUTURE | 0.26 | 26087 |
| CNT_INSTALMENT | 0.26 | 26071 |
---------------------------------------------------------------------------
display_stats(datasets['application_test'], 'application_test')
--------------------------------------------------------------------------------
application_test
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 121 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 48744 non-null int64
1 NAME_CONTRACT_TYPE 48744 non-null object
2 CODE_GENDER 48744 non-null object
3 FLAG_OWN_CAR 48744 non-null object
4 FLAG_OWN_REALTY 48744 non-null object
5 CNT_CHILDREN 48744 non-null int64
6 AMT_INCOME_TOTAL 48744 non-null float64
7 AMT_CREDIT 48744 non-null float64
8 AMT_ANNUITY 48720 non-null float64
9 AMT_GOODS_PRICE 48744 non-null float64
10 NAME_TYPE_SUITE 47833 non-null object
11 NAME_INCOME_TYPE 48744 non-null object
12 NAME_EDUCATION_TYPE 48744 non-null object
13 NAME_FAMILY_STATUS 48744 non-null object
14 NAME_HOUSING_TYPE 48744 non-null object
15 REGION_POPULATION_RELATIVE 48744 non-null float64
16 DAYS_BIRTH 48744 non-null int64
17 DAYS_EMPLOYED 48744 non-null int64
18 DAYS_REGISTRATION 48744 non-null float64
19 DAYS_ID_PUBLISH 48744 non-null int64
20 OWN_CAR_AGE 16432 non-null float64
21 FLAG_MOBIL 48744 non-null int64
22 FLAG_EMP_PHONE 48744 non-null int64
23 FLAG_WORK_PHONE 48744 non-null int64
24 FLAG_CONT_MOBILE 48744 non-null int64
25 FLAG_PHONE 48744 non-null int64
26 FLAG_EMAIL 48744 non-null int64
27 OCCUPATION_TYPE 33139 non-null object
28 CNT_FAM_MEMBERS 48744 non-null float64
29 REGION_RATING_CLIENT 48744 non-null int64
30 REGION_RATING_CLIENT_W_CITY 48744 non-null int64
31 WEEKDAY_APPR_PROCESS_START 48744 non-null object
32 HOUR_APPR_PROCESS_START 48744 non-null int64
33 REG_REGION_NOT_LIVE_REGION 48744 non-null int64
34 REG_REGION_NOT_WORK_REGION 48744 non-null int64
35 LIVE_REGION_NOT_WORK_REGION 48744 non-null int64
36 REG_CITY_NOT_LIVE_CITY 48744 non-null int64
37 REG_CITY_NOT_WORK_CITY 48744 non-null int64
38 LIVE_CITY_NOT_WORK_CITY 48744 non-null int64
39 ORGANIZATION_TYPE 48744 non-null object
40 EXT_SOURCE_1 28212 non-null float64
41 EXT_SOURCE_2 48736 non-null float64
42 EXT_SOURCE_3 40076 non-null float64
43 APARTMENTS_AVG 24857 non-null float64
44 BASEMENTAREA_AVG 21103 non-null float64
45 YEARS_BEGINEXPLUATATION_AVG 25888 non-null float64
46 YEARS_BUILD_AVG 16926 non-null float64
47 COMMONAREA_AVG 15249 non-null float64
48 ELEVATORS_AVG 23555 non-null float64
49 ENTRANCES_AVG 25165 non-null float64
50 FLOORSMAX_AVG 25423 non-null float64
51 FLOORSMIN_AVG 16278 non-null float64
52 LANDAREA_AVG 20490 non-null float64
53 LIVINGAPARTMENTS_AVG 15964 non-null float64
54 LIVINGAREA_AVG 25192 non-null float64
55 NONLIVINGAPARTMENTS_AVG 15397 non-null float64
56 NONLIVINGAREA_AVG 22660 non-null float64
57 APARTMENTS_MODE 24857 non-null float64
58 BASEMENTAREA_MODE 21103 non-null float64
59 YEARS_BEGINEXPLUATATION_MODE 25888 non-null float64
60 YEARS_BUILD_MODE 16926 non-null float64
61 COMMONAREA_MODE 15249 non-null float64
62 ELEVATORS_MODE 23555 non-null float64
63 ENTRANCES_MODE 25165 non-null float64
64 FLOORSMAX_MODE 25423 non-null float64
65 FLOORSMIN_MODE 16278 non-null float64
66 LANDAREA_MODE 20490 non-null float64
67 LIVINGAPARTMENTS_MODE 15964 non-null float64
68 LIVINGAREA_MODE 25192 non-null float64
69 NONLIVINGAPARTMENTS_MODE 15397 non-null float64
70 NONLIVINGAREA_MODE 22660 non-null float64
71 APARTMENTS_MEDI 24857 non-null float64
72 BASEMENTAREA_MEDI 21103 non-null float64
73 YEARS_BEGINEXPLUATATION_MEDI 25888 non-null float64
74 YEARS_BUILD_MEDI 16926 non-null float64
75 COMMONAREA_MEDI 15249 non-null float64
76 ELEVATORS_MEDI 23555 non-null float64
77 ENTRANCES_MEDI 25165 non-null float64
78 FLOORSMAX_MEDI 25423 non-null float64
79 FLOORSMIN_MEDI 16278 non-null float64
80 LANDAREA_MEDI 20490 non-null float64
81 LIVINGAPARTMENTS_MEDI 15964 non-null float64
82 LIVINGAREA_MEDI 25192 non-null float64
83 NONLIVINGAPARTMENTS_MEDI 15397 non-null float64
84 NONLIVINGAREA_MEDI 22660 non-null float64
85 FONDKAPREMONT_MODE 15947 non-null object
86 HOUSETYPE_MODE 25125 non-null object
87 TOTALAREA_MODE 26120 non-null float64
88 WALLSMATERIAL_MODE 24851 non-null object
89 EMERGENCYSTATE_MODE 26535 non-null object
90 OBS_30_CNT_SOCIAL_CIRCLE 48715 non-null float64
91 DEF_30_CNT_SOCIAL_CIRCLE 48715 non-null float64
92 OBS_60_CNT_SOCIAL_CIRCLE 48715 non-null float64
93 DEF_60_CNT_SOCIAL_CIRCLE 48715 non-null float64
94 DAYS_LAST_PHONE_CHANGE 48744 non-null float64
95 FLAG_DOCUMENT_2 48744 non-null int64
96 FLAG_DOCUMENT_3 48744 non-null int64
97 FLAG_DOCUMENT_4 48744 non-null int64
98 FLAG_DOCUMENT_5 48744 non-null int64
99 FLAG_DOCUMENT_6 48744 non-null int64
100 FLAG_DOCUMENT_7 48744 non-null int64
101 FLAG_DOCUMENT_8 48744 non-null int64
102 FLAG_DOCUMENT_9 48744 non-null int64
103 FLAG_DOCUMENT_10 48744 non-null int64
104 FLAG_DOCUMENT_11 48744 non-null int64
105 FLAG_DOCUMENT_12 48744 non-null int64
106 FLAG_DOCUMENT_13 48744 non-null int64
107 FLAG_DOCUMENT_14 48744 non-null int64
108 FLAG_DOCUMENT_15 48744 non-null int64
109 FLAG_DOCUMENT_16 48744 non-null int64
110 FLAG_DOCUMENT_17 48744 non-null int64
111 FLAG_DOCUMENT_18 48744 non-null int64
112 FLAG_DOCUMENT_19 48744 non-null int64
113 FLAG_DOCUMENT_20 48744 non-null int64
114 FLAG_DOCUMENT_21 48744 non-null int64
115 AMT_REQ_CREDIT_BUREAU_HOUR 42695 non-null float64
116 AMT_REQ_CREDIT_BUREAU_DAY 42695 non-null float64
117 AMT_REQ_CREDIT_BUREAU_WEEK 42695 non-null float64
118 AMT_REQ_CREDIT_BUREAU_MON 42695 non-null float64
119 AMT_REQ_CREDIT_BUREAU_QRT 42695 non-null float64
120 AMT_REQ_CREDIT_BUREAU_YEAR 42695 non-null float64
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
---------------------------------------------------------------------------
Shape of the df application_test is (48744, 121)
---------------------------------------------------------------------------
Statistical summary of application_test is :
---------------------------------------------------------------------------
Description of the df application_test:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['application_test'], 'application_test')
Description of the df continued for application_test:
---------------------------------------------------------------------------
Data type value counts:
float64 65
int64 40
object 16
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_TYPE 2
CODE_GENDER 2
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 7
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 5
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of application_test.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_CURR', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
| BASEMENTAREA_MEDI | 56.71 | 27641 |
| BASEMENTAREA_AVG | 56.71 | 27641 |
| BASEMENTAREA_MODE | 56.71 | 27641 |
| NONLIVINGAREA_AVG | 53.51 | 26084 |
| NONLIVINGAREA_MODE | 53.51 | 26084 |
| NONLIVINGAREA_MEDI | 53.51 | 26084 |
| ELEVATORS_MODE | 51.68 | 25189 |
| ELEVATORS_MEDI | 51.68 | 25189 |
| ELEVATORS_AVG | 51.68 | 25189 |
| WALLSMATERIAL_MODE | 49.02 | 23893 |
| APARTMENTS_MODE | 49.01 | 23887 |
| APARTMENTS_MEDI | 49.01 | 23887 |
| APARTMENTS_AVG | 49.01 | 23887 |
| HOUSETYPE_MODE | 48.46 | 23619 |
| ENTRANCES_MODE | 48.37 | 23579 |
| ENTRANCES_AVG | 48.37 | 23579 |
| ENTRANCES_MEDI | 48.37 | 23579 |
| LIVINGAREA_MEDI | 48.32 | 23552 |
| LIVINGAREA_MODE | 48.32 | 23552 |
| LIVINGAREA_AVG | 48.32 | 23552 |
| FLOORSMAX_AVG | 47.84 | 23321 |
| FLOORSMAX_MEDI | 47.84 | 23321 |
| FLOORSMAX_MODE | 47.84 | 23321 |
| YEARS_BEGINEXPLUATATION_AVG | 46.89 | 22856 |
| YEARS_BEGINEXPLUATATION_MEDI | 46.89 | 22856 |
| YEARS_BEGINEXPLUATATION_MODE | 46.89 | 22856 |
| TOTALAREA_MODE | 46.41 | 22624 |
| EMERGENCYSTATE_MODE | 45.56 | 22209 |
| EXT_SOURCE_1 | 42.12 | 20532 |
| OCCUPATION_TYPE | 32.01 | 15605 |
| EXT_SOURCE_3 | 17.78 | 8668 |
| AMT_REQ_CREDIT_BUREAU_DAY | 12.41 | 6049 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 12.41 | 6049 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 12.41 | 6049 |
| AMT_REQ_CREDIT_BUREAU_MON | 12.41 | 6049 |
| AMT_REQ_CREDIT_BUREAU_QRT | 12.41 | 6049 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 12.41 | 6049 |
| NAME_TYPE_SUITE | 1.87 | 911 |
| DEF_30_CNT_SOCIAL_CIRCLE | 0.06 | 29 |
| OBS_30_CNT_SOCIAL_CIRCLE | 0.06 | 29 |
| OBS_60_CNT_SOCIAL_CIRCLE | 0.06 | 29 |
| DEF_60_CNT_SOCIAL_CIRCLE | 0.06 | 29 |
| AMT_ANNUITY | 0.05 | 24 |
| EXT_SOURCE_2 | 0.02 | 8 |
---------------------------------------------------------------------------
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Canceled | -14 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Canceled | -14 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
appsDF.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
appsDF.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# print(f"{appsDF[features].describe()}")
# agg_ops = ["min", "max", "mean"]
# result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
# display(result.head())
# print("-"*50)
# result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : agg_ops, 'AMT_APPLICATION' : agg_ops})
# result.columns = result.columns.map('_'.join)
# display(result)
# result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
# print(f"result.shape: {result.shape}")
# result[0:10]
# result.isna().sum()
# # Create aggregate features (via pipeline)
# class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
# def __init__(self, features=None): # no *args or **kargs
# self.features = features
# self.agg_op_features = ["min", "max", "mean"]
# def fit(self, X, y=None):
# return self
# def transform(self, X, y=None):
# #from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
# result = X.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
# result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : self.agg_op_features, 'AMT_APPLICATION' : self.agg_op_features})
# result.columns = result.columns.map('_'.join)
# #display(result)
# #result = result.reset_index(level=["SK_ID_CURR"])
# result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
# return result # return dataframe with the join key "SK_ID_CURR"
# # result[0:10]
# from sklearn.pipeline import make_pipeline
# def test_driver_prevAppsFeaturesAggregater(df, features):
# print(f"df.shape: {df.shape}\n")
# print(f"df[{features}][0:5]: \n{df[features][0:5]}")
# test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
# return(test_pipeline.fit_transform(df))
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# features = ['AMT_ANNUITY',
# 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
# 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
# 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
# 'CNT_PAYMENT',
# 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
# 'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
# print(f"HELLO")
# print(f"Test driver: \n{res[0:10]}")
# print(f"input[features][0:10]: \n{appsDF[0:10]}")
# # QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
Merge secondardy dataset with Primary dataset's (application_train) target variable to understand correlation between target variable and the secondary dataset's features.
def correlation_with_target(df):
app_train = datasets["application_train"].copy()
second_df = datasets[df].copy()
corr_matrix = pd.concat([app_train.TARGET, second_df], axis=1).corr().filter(second_df.columns).filter(app_train.columns, axis=0)
return corr_matrix
The following secondary datasets will be explored for correlation against the target variable.
for dataset in datasets.keys():
print(dataset)
application_train application_test bureau bureau_balance credit_card_balance installments_payments previous_application POS_CASH_balance
df_name = "bureau"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the bureau against the Target is :
DAYS_CREDIT_UPDATE 0.002159 DAYS_CREDIT_ENDDATE 0.002048 SK_ID_BUREAU 0.001550 DAYS_CREDIT 0.001443 AMT_CREDIT_SUM 0.000218 DAYS_ENDDATE_FACT 0.000203 AMT_ANNUITY 0.000189 AMT_CREDIT_MAX_OVERDUE -0.000389 CNT_CREDIT_PROLONG -0.000495 AMT_CREDIT_SUM_LIMIT -0.000558 AMT_CREDIT_SUM_DEBT -0.000946 SK_ID_CURR -0.001070 AMT_CREDIT_SUM_OVERDUE -0.001464 CREDIT_DAY_OVERDUE -0.001815 Name: TARGET, dtype: float64
Important features from Phase 1: 'AMT_ANNUITY', 'AMT_CREDIT_SUM','DAYS_CREDIT','AMT_CREDIT_SUM_OVERDUE','CREDIT_DAY_OVERDUE'
Important features from Phase 2: 'AMT_CREDIT_SUM','AMT_CREDIT_SUM_DEBT','AMT_CREDIT_SUM_LIMIT','AMT_CREDIT_MAX_OVERDUE'
df_name = "bureau_balance"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the bureau_balance against the Target is :
SK_ID_BUREAU 0.001223 MONTHS_BALANCE -0.005262 Name: TARGET, dtype: float64
df_name = "credit_card_balance"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the credit_card_balance against the Target is :
CNT_DRAWINGS_ATM_CURRENT 0.001908 AMT_DRAWINGS_ATM_CURRENT 0.001520 AMT_INST_MIN_REGULARITY 0.001435 SK_ID_CURR 0.001086 AMT_CREDIT_LIMIT_ACTUAL 0.000515 AMT_BALANCE 0.000448 SK_ID_PREV 0.000446 AMT_RECIVABLE 0.000412 AMT_TOTAL_RECEIVABLE 0.000407 AMT_RECEIVABLE_PRINCIPAL 0.000383 SK_DPD 0.000092 SK_DPD_DEF -0.000201 CNT_INSTALMENT_MATURE_CUM -0.000342 MONTHS_BALANCE -0.000768 AMT_PAYMENT_CURRENT -0.001129 AMT_PAYMENT_TOTAL_CURRENT -0.001395 AMT_DRAWINGS_CURRENT -0.001419 CNT_DRAWINGS_CURRENT -0.001764 CNT_DRAWINGS_OTHER_CURRENT -0.001833 CNT_DRAWINGS_POS_CURRENT -0.002387 AMT_DRAWINGS_OTHER_CURRENT -0.002672 AMT_DRAWINGS_POS_CURRENT -0.003518 Name: TARGET, dtype: float64
Important features from Phase 1: 'MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM','AMT_DRAWINGS_ATM_CURRENT' ,'AMT_INST_MIN_REGULARITY','AMT_PAYMENT_TOTAL_CURRENT'
Important features from Phase 2: 'CNT_DRAWINGS_ATM_CURRENT','AMT_CREDIT_LIMIT_ACTUAL','AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE','AMT_RECEIVABLE_PRINCIPAL'
df_name = "installments_payments"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the installments_payments against the Target is :
SK_ID_PREV 0.002891 NUM_INSTALMENT_VERSION 0.002511 NUM_INSTALMENT_NUMBER 0.000626 SK_ID_CURR -0.000781 AMT_PAYMENT -0.003512 DAYS_INSTALMENT -0.003955 AMT_INSTALMENT -0.003972 DAYS_ENTRY_PAYMENT -0.004046 Name: TARGET, dtype: float64
Important features from Phase 1: 'AMT_INSTALMENT', 'AMT_PAYMENT'
Important features from Phase 2: 'DAYS_ENTRY_PAYMENT','DAYS_INSTALMENT','NUM_INSTALMENT_VERSION'
df_name = "previous_application"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the previous_application against the Target is :
AMT_DOWN_PAYMENT 0.002496 CNT_PAYMENT 0.002341 DAYS_LAST_DUE_1ST_VERSION 0.001908 AMT_CREDIT 0.001833 AMT_APPLICATION 0.001689 AMT_GOODS_PRICE 0.001676 SK_ID_CURR 0.001107 NFLAG_INSURED_ON_APPROVAL 0.000879 RATE_DOWN_PAYMENT 0.000850 RATE_INTEREST_PRIMARY 0.000542 SK_ID_PREV 0.000362 DAYS_DECISION -0.000482 AMT_ANNUITY -0.000492 DAYS_FIRST_DUE -0.000943 SELLERPLACE_AREA -0.000954 DAYS_TERMINATION -0.001072 NFLAG_LAST_APPL_IN_DAY -0.001256 DAYS_FIRST_DRAWING -0.001293 DAYS_LAST_DUE -0.001940 HOUR_APPR_PROCESS_START -0.002285 RATE_INTEREST_PRIVILEGED -0.026427 Name: TARGET, dtype: float64
Important features from Phase 1: 'AMT_ANNUITY', 'AMT_APPLICATION','AMT_DOWN_PAYMENT','CNT_PAYMENT','RATE_INTEREST_PRIVILEGED'
Important features from Phase 2: 'AMT_CREDIT','DAYS_FIRST_DRAWING','DAYS_LAST_DUE','HOUR_APPR_PROCESS_START','DAYS_FIRST_DUE'
df_name = "POS_CASH_balance"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the POS_CASH_balance against the Target is :
CNT_INSTALMENT_FUTURE 0.002811 MONTHS_BALANCE 0.002775 SK_ID_PREV 0.002164 CNT_INSTALMENT 0.001434 SK_DPD 0.000050 SK_ID_CURR -0.000136 SK_DPD_DEF -0.001362 Name: TARGET, dtype: float64
# Pipelines
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
class FeaturesAggregator(BaseEstimator, TransformerMixin):
def __init__(self, file_name=None, features=None, funcs=None, primary_key=None): # no *args or **kargs
self.file_name = file_name
self.features = features
self.funcs = funcs
self.primary_key = primary_key
self.agg_op_features = {}
for f in self.features:
temp = {f"{file_name}_{f}_{func}":func for func in self.funcs}
self.agg_op_features[f]=[(k, v) for k, v in temp.items()]
print(self.agg_op_features)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby([self.primary_key]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=[self.primary_key])
return result
agg_funcs = ['min', 'max', 'mean']
# prevApps = datasets['previous_application']
prevApps_features = ['AMT_ANNUITY', 'AMT_APPLICATION','AMT_DOWN_PAYMENT','CNT_PAYMENT','RATE_INTEREST_PRIVILEGED' # Phase 1
,'AMT_CREDIT','DAYS_FIRST_DRAWING','DAYS_LAST_DUE','DAYS_FIRST_DUE' # Phase 2
]
# bureau = datasets['bureau']
bureau_features = ['AMT_ANNUITY', 'AMT_CREDIT_SUM','DAYS_CREDIT','AMT_CREDIT_SUM_OVERDUE','CREDIT_DAY_OVERDUE' # Phase 1
,'AMT_CREDIT_SUM_DEBT','AMT_CREDIT_SUM_LIMIT','AMT_CREDIT_MAX_OVERDUE' # Phase 2
]
# bureau_funcs = ['min', 'max', 'mean', 'count', 'sum']
# bureau_bal = datasets['bureau_balance']
bureau_bal_features = ['MONTHS_BALANCE'] # Phase 1
# cc_bal = datasets['credit_card_balance']
cc_bal_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM','AMT_DRAWINGS_ATM_CURRENT'
,'AMT_INST_MIN_REGULARITY','AMT_PAYMENT_TOTAL_CURRENT' # Phase 1
,'CNT_DRAWINGS_ATM_CURRENT','AMT_CREDIT_LIMIT_ACTUAL','AMT_RECIVABLE'
,'AMT_TOTAL_RECEIVABLE','AMT_RECEIVABLE_PRINCIPAL' # Phase 2
]
# installments_pmnts = datasets['installments_payments']
installments_pmnts_features = ['AMT_INSTALMENT', 'AMT_PAYMENT' # Phase 1
,'DAYS_ENTRY_PAYMENT','DAYS_INSTALMENT','NUM_INSTALMENT_VERSION' # Phase 2
]
pos_cash_balance_features = ['CNT_INSTALMENT_FUTURE','MONTHS_BALANCE','SK_DPD_DEF'] # Phase 1
Engineer new features capturing relationship between income and credit amount as well as annuity and income for Application dataset
class engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# FROM APPLICATION
# Added in Phase 1
# Income Credit Percentage
X['ef_INCOME_CREDIT_PERCENT'] = (X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
# Annuity as Percentage of Annual Income
X['ef_ANN_INCOME_PERCENT'] = (X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
# Added in Phase 2
# Goods Price as Percentage of Annual Income
X['ef_GOODS_PRICE_PERCENT'] = (X.AMT_GOODS_PRICE / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
# Count of non children family members
X['ef_CNT_NON_CHILDREN'] = X.CNT_FAM_MEMBERS - X.CNT_CHILDREN
# Living to Land Area Ratio
X['ef_LIVINGAREA_LANDAREA_AVG_RATIO'] = (X.LIVINGAREA_AVG / X.LANDAREA_AVG).replace(np.inf, 0)
return X
Engineer new features capturing range of annuity, application, and downpayment amounts from the Previous Application dataset
class prevApp_engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# FROM PREVIOUS APPLICATION
# Added in Phase 1
# Add Annuity, Application, and Downpaymet ranges
X['ef_prevApps_AMT_ANNUITY_range'] = (X.prevApps_AMT_ANNUITY_max - X.prevApps_AMT_ANNUITY_min).replace(np.inf, 0)
X['ef_prevApps_AMT_APPLICATION_range'] = (X.prevApps_AMT_APPLICATION_max - X.prevApps_AMT_APPLICATION_min).replace(np.inf, 0)
X['ef_prevApps_AMT_DOWN_PAYMENT_range'] = (X.prevApps_AMT_DOWN_PAYMENT_max - X.prevApps_AMT_DOWN_PAYMENT_min).replace(np.inf, 0)
return X
Engineer new features capturing range of annuity, application, and downpayment amounts from the Bureau dataset.
class bureau_engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# FROM BUREAU DATASET
# Added in Phase 2
# Add Debt to Credit Limit Ratio, Overdue amount to Credit Ratio
bureau_AMT_CREDIT_SUM_DEBT_mean_nan = np.mean(X.bureau_AMT_CREDIT_SUM_DEBT_mean)
bureau_AMT_CREDIT_SUM_OVERDUE_mean_nan = np.mean(X.bureau_AMT_CREDIT_SUM_OVERDUE_mean)
bureau_AMT_CREDIT_SUM_LIMIT_mean_nan = np.mean(X.bureau_AMT_CREDIT_SUM_LIMIT_mean)
X['ef_bureau_AMT_DEBT_CREDIT_RATIO'] = ( (X.bureau_AMT_CREDIT_SUM_DEBT_mean).fillna(bureau_AMT_CREDIT_SUM_DEBT_mean_nan) / (X.bureau_AMT_CREDIT_SUM_LIMIT_mean).fillna(bureau_AMT_CREDIT_SUM_LIMIT_mean_nan) ).replace(np.inf, 0) # fill with mean or median instead of 0
X['ef_bureau_AMT_OVERDUE_CREDIT_RATIO'] = ( (X.bureau_AMT_CREDIT_SUM_OVERDUE_mean).fillna(bureau_AMT_CREDIT_SUM_OVERDUE_mean_nan) / (X.bureau_AMT_CREDIT_SUM_LIMIT_mean).fillna(bureau_AMT_CREDIT_SUM_LIMIT_mean_nan) ).replace(np.inf, 0) # fill with mean or median instead of 0
# Add Credit, Debt & Overdue ranges
X['ef_bureau_AMT_CREDIT_SUM_range'] = (X.bureau_AMT_CREDIT_SUM_max - X.bureau_AMT_CREDIT_SUM_min).replace(np.inf, 0)
X['ef_bureau_AMT_CREDIT_SUM_DEBT_range'] = (X.bureau_AMT_CREDIT_SUM_DEBT_max - X.bureau_AMT_CREDIT_SUM_DEBT_min).replace(np.inf, 0)
X['ef_bureau_AMT_CREDIT_SUM_OVERDUE_range'] = (X.bureau_AMT_CREDIT_SUM_OVERDUE_max - X.bureau_AMT_CREDIT_SUM_OVERDUE_min).replace(np.inf, 0)
return X
from sklearn.pipeline import make_pipeline, Pipeline
primary_key="SK_ID_CURR"
primary_key1= "SK_ID_BUREAU"
# Feature engineering pipeline for application_train:
appln_new_features_pipeline = Pipeline([
('engineer_features', engineer_features()), # add some new features
])
# Feature Engineering for all secondary Datasets:
prevApps_features_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('prevApps_aggregator', FeaturesAggregator('prevApps', prevApps_features, agg_funcs,primary_key)), # Aggregate across old and new features
('prevApp_engineer_features', prevApp_engineer_features())
])
bureau_features_pipeline = Pipeline([
('bureau_aggregator', FeaturesAggregator('bureau', bureau_features, agg_funcs,primary_key)), # Aggregate across old and new features
('bureau_engineer_features', bureau_engineer_features())
])
bureau_bal_features_pipeline = Pipeline([
('bureau_bal_aggregator', FeaturesAggregator('bureau_balance', bureau_bal_features , agg_funcs,primary_key1)), # Aggregate across old and new features
])
cc_bal_features_pipeline = Pipeline([
('cc_bal_aggregator', FeaturesAggregator('credit_card_balance', cc_bal_features , agg_funcs,primary_key)), # Aggregate across old and new features
])
installments_pmnts_features_pipeline = Pipeline([
('installments_pmnts_features_aggregator', FeaturesAggregator('installments_pmnts', installments_pmnts_features , agg_funcs,primary_key)), # Aggregate across old and new features
])
pos_cash_balance_features_pipeline = Pipeline([
('pos_cash_balance_features_aggregator', FeaturesAggregator('pos_cash_balance', pos_cash_balance_features , agg_funcs,primary_key)), # Aggregate across old and new features
])
{'AMT_ANNUITY': [('prevApps_AMT_ANNUITY_min', 'min'), ('prevApps_AMT_ANNUITY_max', 'max'), ('prevApps_AMT_ANNUITY_mean', 'mean')], 'AMT_APPLICATION': [('prevApps_AMT_APPLICATION_min', 'min'), ('prevApps_AMT_APPLICATION_max', 'max'), ('prevApps_AMT_APPLICATION_mean', 'mean')], 'AMT_DOWN_PAYMENT': [('prevApps_AMT_DOWN_PAYMENT_min', 'min'), ('prevApps_AMT_DOWN_PAYMENT_max', 'max'), ('prevApps_AMT_DOWN_PAYMENT_mean', 'mean')], 'CNT_PAYMENT': [('prevApps_CNT_PAYMENT_min', 'min'), ('prevApps_CNT_PAYMENT_max', 'max'), ('prevApps_CNT_PAYMENT_mean', 'mean')], 'RATE_INTEREST_PRIVILEGED': [('prevApps_RATE_INTEREST_PRIVILEGED_min', 'min'), ('prevApps_RATE_INTEREST_PRIVILEGED_max', 'max'), ('prevApps_RATE_INTEREST_PRIVILEGED_mean', 'mean')], 'AMT_CREDIT': [('prevApps_AMT_CREDIT_min', 'min'), ('prevApps_AMT_CREDIT_max', 'max'), ('prevApps_AMT_CREDIT_mean', 'mean')], 'DAYS_FIRST_DRAWING': [('prevApps_DAYS_FIRST_DRAWING_min', 'min'), ('prevApps_DAYS_FIRST_DRAWING_max', 'max'), ('prevApps_DAYS_FIRST_DRAWING_mean', 'mean')], 'DAYS_LAST_DUE': [('prevApps_DAYS_LAST_DUE_min', 'min'), ('prevApps_DAYS_LAST_DUE_max', 'max'), ('prevApps_DAYS_LAST_DUE_mean', 'mean')], 'DAYS_FIRST_DUE': [('prevApps_DAYS_FIRST_DUE_min', 'min'), ('prevApps_DAYS_FIRST_DUE_max', 'max'), ('prevApps_DAYS_FIRST_DUE_mean', 'mean')]}
{'AMT_ANNUITY': [('bureau_AMT_ANNUITY_min', 'min'), ('bureau_AMT_ANNUITY_max', 'max'), ('bureau_AMT_ANNUITY_mean', 'mean')], 'AMT_CREDIT_SUM': [('bureau_AMT_CREDIT_SUM_min', 'min'), ('bureau_AMT_CREDIT_SUM_max', 'max'), ('bureau_AMT_CREDIT_SUM_mean', 'mean')], 'DAYS_CREDIT': [('bureau_DAYS_CREDIT_min', 'min'), ('bureau_DAYS_CREDIT_max', 'max'), ('bureau_DAYS_CREDIT_mean', 'mean')], 'AMT_CREDIT_SUM_OVERDUE': [('bureau_AMT_CREDIT_SUM_OVERDUE_min', 'min'), ('bureau_AMT_CREDIT_SUM_OVERDUE_max', 'max'), ('bureau_AMT_CREDIT_SUM_OVERDUE_mean', 'mean')], 'CREDIT_DAY_OVERDUE': [('bureau_CREDIT_DAY_OVERDUE_min', 'min'), ('bureau_CREDIT_DAY_OVERDUE_max', 'max'), ('bureau_CREDIT_DAY_OVERDUE_mean', 'mean')], 'AMT_CREDIT_SUM_DEBT': [('bureau_AMT_CREDIT_SUM_DEBT_min', 'min'), ('bureau_AMT_CREDIT_SUM_DEBT_max', 'max'), ('bureau_AMT_CREDIT_SUM_DEBT_mean', 'mean')], 'AMT_CREDIT_SUM_LIMIT': [('bureau_AMT_CREDIT_SUM_LIMIT_min', 'min'), ('bureau_AMT_CREDIT_SUM_LIMIT_max', 'max'), ('bureau_AMT_CREDIT_SUM_LIMIT_mean', 'mean')], 'AMT_CREDIT_MAX_OVERDUE': [('bureau_AMT_CREDIT_MAX_OVERDUE_min', 'min'), ('bureau_AMT_CREDIT_MAX_OVERDUE_max', 'max'), ('bureau_AMT_CREDIT_MAX_OVERDUE_mean', 'mean')]}
{'MONTHS_BALANCE': [('bureau_balance_MONTHS_BALANCE_min', 'min'), ('bureau_balance_MONTHS_BALANCE_max', 'max'), ('bureau_balance_MONTHS_BALANCE_mean', 'mean')]}
{'MONTHS_BALANCE': [('credit_card_balance_MONTHS_BALANCE_min', 'min'), ('credit_card_balance_MONTHS_BALANCE_max', 'max'), ('credit_card_balance_MONTHS_BALANCE_mean', 'mean')], 'AMT_BALANCE': [('credit_card_balance_AMT_BALANCE_min', 'min'), ('credit_card_balance_AMT_BALANCE_max', 'max'), ('credit_card_balance_AMT_BALANCE_mean', 'mean')], 'CNT_INSTALMENT_MATURE_CUM': [('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_min', 'min'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_max', 'max'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_mean', 'mean')], 'AMT_DRAWINGS_ATM_CURRENT': [('credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_min', 'min'), ('credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_max', 'max'), ('credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean', 'mean')], 'AMT_INST_MIN_REGULARITY': [('credit_card_balance_AMT_INST_MIN_REGULARITY_min', 'min'), ('credit_card_balance_AMT_INST_MIN_REGULARITY_max', 'max'), ('credit_card_balance_AMT_INST_MIN_REGULARITY_mean', 'mean')], 'AMT_PAYMENT_TOTAL_CURRENT': [('credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_min', 'min'), ('credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_max', 'max'), ('credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean', 'mean')], 'CNT_DRAWINGS_ATM_CURRENT': [('credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_min', 'min'), ('credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_max', 'max'), ('credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_mean', 'mean')], 'AMT_CREDIT_LIMIT_ACTUAL': [('credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_min', 'min'), ('credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_max', 'max'), ('credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_mean', 'mean')], 'AMT_RECIVABLE': [('credit_card_balance_AMT_RECIVABLE_min', 'min'), ('credit_card_balance_AMT_RECIVABLE_max', 'max'), ('credit_card_balance_AMT_RECIVABLE_mean', 'mean')], 'AMT_TOTAL_RECEIVABLE': [('credit_card_balance_AMT_TOTAL_RECEIVABLE_min', 'min'), ('credit_card_balance_AMT_TOTAL_RECEIVABLE_max', 'max'), ('credit_card_balance_AMT_TOTAL_RECEIVABLE_mean', 'mean')], 'AMT_RECEIVABLE_PRINCIPAL': [('credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_min', 'min'), ('credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_max', 'max'), ('credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_mean', 'mean')]}
{'AMT_INSTALMENT': [('installments_pmnts_AMT_INSTALMENT_min', 'min'), ('installments_pmnts_AMT_INSTALMENT_max', 'max'), ('installments_pmnts_AMT_INSTALMENT_mean', 'mean')], 'AMT_PAYMENT': [('installments_pmnts_AMT_PAYMENT_min', 'min'), ('installments_pmnts_AMT_PAYMENT_max', 'max'), ('installments_pmnts_AMT_PAYMENT_mean', 'mean')], 'DAYS_ENTRY_PAYMENT': [('installments_pmnts_DAYS_ENTRY_PAYMENT_min', 'min'), ('installments_pmnts_DAYS_ENTRY_PAYMENT_max', 'max'), ('installments_pmnts_DAYS_ENTRY_PAYMENT_mean', 'mean')], 'DAYS_INSTALMENT': [('installments_pmnts_DAYS_INSTALMENT_min', 'min'), ('installments_pmnts_DAYS_INSTALMENT_max', 'max'), ('installments_pmnts_DAYS_INSTALMENT_mean', 'mean')], 'NUM_INSTALMENT_VERSION': [('installments_pmnts_NUM_INSTALMENT_VERSION_min', 'min'), ('installments_pmnts_NUM_INSTALMENT_VERSION_max', 'max'), ('installments_pmnts_NUM_INSTALMENT_VERSION_mean', 'mean')]}
{'CNT_INSTALMENT_FUTURE': [('pos_cash_balance_CNT_INSTALMENT_FUTURE_min', 'min'), ('pos_cash_balance_CNT_INSTALMENT_FUTURE_max', 'max'), ('pos_cash_balance_CNT_INSTALMENT_FUTURE_mean', 'mean')], 'MONTHS_BALANCE': [('pos_cash_balance_MONTHS_BALANCE_min', 'min'), ('pos_cash_balance_MONTHS_BALANCE_max', 'max'), ('pos_cash_balance_MONTHS_BALANCE_mean', 'mean')], 'SK_DPD_DEF': [('pos_cash_balance_SK_DPD_DEF_min', 'min'), ('pos_cash_balance_SK_DPD_DEF_max', 'max'), ('pos_cash_balance_SK_DPD_DEF_mean', 'mean')]}
# Primary Application Training Dataset
appsTrainDF_agg = datasets['application_train']
# Secondary Datasets
prevApps_agg = datasets["previous_application"] #prev app
bureau_agg = datasets["bureau"] #bureau app
bureaubal_agg = datasets['bureau_balance']
ccblance_agg = datasets["credit_card_balance"] #prev app
installmentspayments_agg = datasets["installments_payments"] #bureau app
posbal_agg = datasets['POS_CASH_balance']
Create Aggregate datasets after performing fit & transform
appsTrainDF_agg = appln_new_features_pipeline.fit_transform(appsTrainDF_agg)
prevApps_agg = prevApps_features_pipeline.fit_transform(prevApps_agg)
# prevApps_agg = prevApp_new_features_pipeline.fit_transform(prevApps_agg)
bureaubal_agg = bureau_bal_features_pipeline.fit_transform(bureaubal_agg)
ccblance_agg = cc_bal_features_pipeline.fit_transform(ccblance_agg)
installmentspayments_agg = installments_pmnts_features_pipeline.fit_transform(installmentspayments_agg)
posbal_agg = pos_cash_balance_features_pipeline.fit_transform(posbal_agg)
bureau_agg = bureau_agg.merge(bureaubal_agg, how='left', on='SK_ID_BUREAU')
bureau_agg = bureau_features_pipeline.fit_transform(bureau_agg)
bureau_agg.head()
| SK_ID_CURR | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_mean | bureau_AMT_CREDIT_SUM_min | bureau_AMT_CREDIT_SUM_max | bureau_AMT_CREDIT_SUM_mean | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_mean | ... | bureau_AMT_CREDIT_SUM_LIMIT_max | bureau_AMT_CREDIT_SUM_LIMIT_mean | bureau_AMT_CREDIT_MAX_OVERDUE_min | bureau_AMT_CREDIT_MAX_OVERDUE_max | bureau_AMT_CREDIT_MAX_OVERDUE_mean | ef_bureau_AMT_DEBT_CREDIT_RATIO | ef_bureau_AMT_OVERDUE_CREDIT_RATIO | ef_bureau_AMT_CREDIT_SUM_range | ef_bureau_AMT_CREDIT_SUM_DEBT_range | ef_bureau_AMT_CREDIT_SUM_OVERDUE_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 0.0 | 10822.5 | 3545.357143 | 85500.0 | 378000.0 | 207623.571429 | -1572 | -49 | -735.000000 | ... | 0.000 | 0.00000 | NaN | NaN | NaN | 0.000000 | NaN | 292500.0 | 373239.0 | 0.0 |
| 1 | 100002 | 0.0 | 0.0 | 0.000000 | 0.0 | 450000.0 | 108131.945625 | -1437 | -103 | -874.000000 | ... | 31988.565 | 7997.14125 | 0.0 | 5043.645 | 1681.029 | 6.146721 | 0.0 | 450000.0 | 245781.0 | 0.0 |
| 2 | 100003 | NaN | NaN | NaN | 22248.0 | 810000.0 | 254350.125000 | -2586 | -606 | -1400.750000 | ... | 810000.000 | 202500.00000 | 0.0 | 0.000 | 0.000 | 0.000000 | 0.0 | 787752.0 | 0.0 | 0.0 |
| 3 | 100004 | NaN | NaN | NaN | 94500.0 | 94537.8 | 94518.900000 | -1326 | -408 | -867.000000 | ... | 0.000 | 0.00000 | 0.0 | 0.000 | 0.000 | NaN | NaN | 37.8 | 0.0 | 0.0 |
| 4 | 100005 | 0.0 | 4261.5 | 1420.500000 | 29826.0 | 568800.0 | 219042.000000 | -373 | -62 | -190.666667 | ... | 0.000 | 0.00000 | 0.0 | 0.000 | 0.000 | 0.000000 | NaN | 538974.0 | 543087.0 | 0.0 |
5 rows × 30 columns
prevApps_agg.head()
| SK_ID_CURR | prevApps_AMT_ANNUITY_min | prevApps_AMT_ANNUITY_max | prevApps_AMT_ANNUITY_mean | prevApps_AMT_APPLICATION_min | prevApps_AMT_APPLICATION_max | prevApps_AMT_APPLICATION_mean | prevApps_AMT_DOWN_PAYMENT_min | prevApps_AMT_DOWN_PAYMENT_max | prevApps_AMT_DOWN_PAYMENT_mean | ... | prevApps_DAYS_FIRST_DRAWING_mean | prevApps_DAYS_LAST_DUE_min | prevApps_DAYS_LAST_DUE_max | prevApps_DAYS_LAST_DUE_mean | prevApps_DAYS_FIRST_DUE_min | prevApps_DAYS_FIRST_DUE_max | prevApps_DAYS_FIRST_DUE_mean | ef_prevApps_AMT_ANNUITY_range | ef_prevApps_AMT_APPLICATION_range | ef_prevApps_AMT_DOWN_PAYMENT_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000 | 24835.5 | 24835.5 | 24835.50 | 2520.0 | 2520.0 | 2520.0 | ... | 365243.0 | -1619.0 | -1619.0 | -1619.000000 | -1709.0 | -1709.0 | -1709.000000 | 0.000 | 0.0 | 0.0 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775 | 179055.0 | 179055.0 | 179055.00 | 0.0 | 0.0 | 0.0 | ... | 365243.0 | -25.0 | -25.0 | -25.000000 | -565.0 | -565.0 | -565.000000 | 0.000 | 0.0 | 0.0 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990 | 68809.5 | 900000.0 | 435436.50 | 0.0 | 6885.0 | 3442.5 | ... | 365243.0 | -1980.0 | -536.0 | -1054.333333 | -2310.0 | -716.0 | -1274.333333 | 91619.685 | 831190.5 | 6885.0 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250 | 24282.0 | 24282.0 | 24282.00 | 4860.0 | 4860.0 | 4860.0 | ... | 365243.0 | -724.0 | -724.0 | -724.000000 | -784.0 | -784.0 | -784.000000 | 0.000 | 0.0 | 0.0 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200 | 0.0 | 44617.5 | 22308.75 | 4464.0 | 4464.0 | 4464.0 | ... | 365243.0 | -466.0 | -466.0 | -466.000000 | -706.0 | -706.0 | -706.000000 | 0.000 | 44617.5 | 0.0 |
5 rows × 31 columns
~3==3
False
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
Perform data merging of primary application and secondary datasets.
merge_all_data = True
# if merge_all_data:
# prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
appsTrainDF_agg = appsTrainDF_agg.merge(prevApps_agg, how='left', on='SK_ID_CURR')
appsTrainDF_agg = appsTrainDF_agg.merge(bureau_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg = appsTrainDF_agg.merge(ccblance_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg = appsTrainDF_agg.merge(installmentspayments_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg = appsTrainDF_agg.merge(posbal_agg, how='left', on="SK_ID_CURR")
#appsTrainDF_agg = appsTrainDF_agg.merge(bureaubal_agg, how='left', on="SK_ID_BUREAU")
appsTrainDF_agg.shape
(307511, 243)
Check presence of newly engineered features
appsTrainDF_agg[['ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT','ef_prevApps_AMT_ANNUITY_range'
, 'ef_prevApps_AMT_APPLICATION_range', 'ef_prevApps_AMT_DOWN_PAYMENT_range'
, 'ef_prevApps_AMT_ANNUITY_range', 'ef_prevApps_AMT_APPLICATION_range'
, 'ef_prevApps_AMT_DOWN_PAYMENT_range']].head()
| ef_INCOME_CREDIT_PERCENT | ef_ANN_INCOME_PERCENT | ef_prevApps_AMT_ANNUITY_range | ef_prevApps_AMT_APPLICATION_range | ef_prevApps_AMT_DOWN_PAYMENT_range | ef_prevApps_AMT_ANNUITY_range | ef_prevApps_AMT_APPLICATION_range | ef_prevApps_AMT_DOWN_PAYMENT_range | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.498036 | 0.121978 | 0.000 | 0.0 | 0.00 | 0.000 | 0.0 | 0.00 |
| 1 | 0.208736 | 0.132217 | 91619.685 | 831190.5 | 6885.00 | 91619.685 | 831190.5 | 6885.00 |
| 2 | 0.500000 | 0.100000 | 0.000 | 0.0 | 0.00 | 0.000 | 0.0 | 0.00 |
| 3 | 0.431748 | 0.219900 | 37471.590 | 688500.0 | 64293.66 | 37471.590 | 688500.0 | 64293.66 |
| 4 | 0.236842 | 0.179963 | 20844.495 | 230323.5 | 571.50 | 20844.495 | 230323.5 | 571.50 |
Perform data merging of primary application and secondary datasets.
X_kaggle_test = datasets["application_test"]
X_kaggle_test = appln_new_features_pipeline.fit_transform(X_kaggle_test)
merge_all_data = True
if merge_all_data:
X_kaggle_test = X_kaggle_test.merge(prevApps_agg, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(bureau_agg, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(ccblance_agg, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(installmentspayments_agg, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(posbal_agg, how='left', on="SK_ID_CURR")
print(X_kaggle_test.shape)
X_kaggle_test.head(3)
(48744, 242)
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | installments_pmnts_NUM_INSTALMENT_VERSION_mean | pos_cash_balance_CNT_INSTALMENT_FUTURE_min | pos_cash_balance_CNT_INSTALMENT_FUTURE_max | pos_cash_balance_CNT_INSTALMENT_FUTURE_mean | pos_cash_balance_MONTHS_BALANCE_min | pos_cash_balance_MONTHS_BALANCE_max | pos_cash_balance_MONTHS_BALANCE_mean | pos_cash_balance_SK_DPD_DEF_min | pos_cash_balance_SK_DPD_DEF_max | pos_cash_balance_SK_DPD_DEF_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 1.142857 | 0.0 | 4.0 | 1.444444 | -96.0 | -53.0 | -72.555556 | 0.0 | 7.0 | 0.777778 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 1.111111 | 0.0 | 12.0 | 7.200000 | -25.0 | -15.0 | -20.000000 | 0.0 | 0.0 | 0.000000 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0.277419 | 0.0 | 36.0 | 15.305556 | -66.0 | -3.0 | -29.555556 | 0.0 | 0.0 | 0.000000 |
3 rows × 242 columns
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
Please this blog for more details of OHE when the validation/test have previously unseen unique values.
#train_dataset = datasets["application_train"]
train_dataset = appsTrainDF_agg
class_labels = ["No Default","Default"]
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
num_attribs = [
'AMT_INCOME_TOTAL',
'AMT_CREDIT',
'EXT_SOURCE_3',
'EXT_SOURCE_2',
'EXT_SOURCE_1',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'FLOORSMAX_AVG',
'FLOORSMAX_MEDI',
'FLOORSMAX_MODE',
'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE',
'REG_CITY_NOT_LIVE_CITY',
'FLAG_EMP_PHONE',
'REG_CITY_NOT_WORK_CITY',
'DAYS_ID_PUBLISH',
'DAYS_LAST_PHONE_CHANGE',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY',
'OWN_CAR_AGE',
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR',
'ef_INCOME_CREDIT_PERCENT',
'ef_ANN_INCOME_PERCENT',
'ef_GOODS_PRICE_PERCENT',
'ef_CNT_NON_CHILDREN',
'ef_LIVINGAREA_LANDAREA_AVG_RATIO',
## Highly correlated previous applications
'prevApps_AMT_ANNUITY_mean',
'prevApps_AMT_DOWN_PAYMENT_min',
'prevApps_AMT_DOWN_PAYMENT_mean',
'prevApps_CNT_PAYMENT_max',
'prevApps_RATE_INTEREST_PRIVILEGED_mean',
'prevApps_AMT_CREDIT_mean',
'prevApps_AMT_CREDIT_min',
'prevApps_AMT_CREDIT_max',
'prevApps_DAYS_FIRST_DRAWING_mean',
'prevApps_DAYS_LAST_DUE_mean',
'prevApps_DAYS_FIRST_DUE_mean',
'ef_prevApps_AMT_ANNUITY_range',
'ef_prevApps_AMT_APPLICATION_range',
'ef_prevApps_AMT_DOWN_PAYMENT_range',
## Highly correlated Bureau features
'bureau_AMT_ANNUITY_mean',
'bureau_AMT_CREDIT_SUM_mean',
'bureau_DAYS_CREDIT_mean',
'bureau_DAYS_CREDIT_max',
'bureau_AMT_CREDIT_SUM_DEBT_mean',
'bureau_AMT_CREDIT_SUM_LIMIT_mean',
'bureau_AMT_CREDIT_MAX_OVERDUE_mean',
'ef_bureau_AMT_CREDIT_SUM_range',
'ef_bureau_AMT_CREDIT_SUM_DEBT_range',
'ef_bureau_AMT_CREDIT_SUM_OVERDUE_range',
## Highly correlated Installment Payment features
'installments_pmnts_AMT_INSTALMENT_min',
'installments_pmnts_AMT_INSTALMENT_max',
'installments_pmnts_AMT_INSTALMENT_mean',
'installments_pmnts_AMT_PAYMENT_mean',
'installments_pmnts_DAYS_ENTRY_PAYMENT_min',
'installments_pmnts_DAYS_ENTRY_PAYMENT_max',
'installments_pmnts_DAYS_ENTRY_PAYMENT_mean',
'installments_pmnts_DAYS_INSTALMENT_min',
'installments_pmnts_DAYS_INSTALMENT_max',
'installments_pmnts_DAYS_INSTALMENT_mean',
'installments_pmnts_NUM_INSTALMENT_VERSION_mean',
## Highly correlated Credit card balance features
'credit_card_balance_MONTHS_BALANCE_min',
'credit_card_balance_MONTHS_BALANCE_max',
'credit_card_balance_MONTHS_BALANCE_mean',
'credit_card_balance_AMT_BALANCE_min',
'credit_card_balance_AMT_BALANCE_max',
'credit_card_balance_AMT_BALANCE_mean',
'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean',
'credit_card_balance_AMT_INST_MIN_REGULARITY_mean',
'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean',
'credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_min',
'credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_max',
'credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_mean',
'credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_max',
'credit_card_balance_AMT_RECIVABLE_mean',
'credit_card_balance_AMT_TOTAL_RECEIVABLE_mean',
'credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_min',
'credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_max',
'credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_mean',
## Highly correlated POS balance features
'pos_cash_balance_CNT_INSTALMENT_FUTURE_min',
'pos_cash_balance_CNT_INSTALMENT_FUTURE_max',
'pos_cash_balance_CNT_INSTALMENT_FUTURE_mean',
'pos_cash_balance_MONTHS_BALANCE_mean'
]
print('Number of numerical features: ', len(num_attribs))
Number of numerical features: 84
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='median')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
# cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
# 'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
print('Number of numerical features: ', len(cat_attribs))
Number of numerical features: 7
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
Use ColumnTransformer instead of FeatureUnion
from sklearn.compose import ColumnTransformer
# data_prep_pipeline = FeatureUnion(transformer_list=[
# ("num_pipeline", num_pipeline),
# ("cat_pipeline", cat_pipeline),
# ])
data_prep_pipeline = ColumnTransformer(transformers=[
#( name, transformer, columns)
("num_pipeline", num_pipeline, num_attribs),
("cat_pipeline", cat_pipeline, cat_attribs),
],
n_jobs=-1
)
# print('Numerical Feature Family: ', num_attribs)
print('Numerical Feature Count: ', len(num_attribs))
print('--------------------------')
# print('Categorical Feature Family: ', cat_attribs)
print('Categorical Feature Count: ', len(cat_attribs))
print('--------------------------')
print('Total Number of Input Features: ', len(num_attribs) + len(cat_attribs))
Numerical Feature Count: 84 -------------------------- Categorical Feature Count: 7 -------------------------- Total Number of Input Features: 91
selected_features = num_attribs + cat_attribs
len(selected_features)
91
# from sklearn.base import BaseEstimator, TransformerMixin
# import re
# # Creates the following date features
# # But could do so much more with these features
# # E.g.,
# # extract the domain address of the homepage and OneHotEncode it
# #
# # ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
# class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
# def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
# self.features = features
# def fit(self, X, y=None):
# return self # nothing else to do
# def transform(self, X):
# df = pd.DataFrame(X, columns=self.features)
# #from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
# # df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)
# #df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(map(lambda x: 1. if x.lower() in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.))
# df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in list(map(lambda x: x.lower(), ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'])) else 0.)
# #df.drop(self.features, axis=1, inplace=True)
# return np.array(df.values) #return a Numpy Array to observe the pipeline protocol
# from sklearn.pipeline import make_pipeline
# features = ["OCCUPATION_TYPE"]
# def test_driver_prep_OCCUPATION_TYPE():
# print(f"X_train.shape: {X_train.shape}\n")
# print(f"X_train['name'][0:5]: \n{X_train[features][0:5]}")
# test_pipeline = make_pipeline(prep_OCCUPATION_TYPE(features))
# return(test_pipeline.fit_transform(X_train))
# x = test_driver_prep_OCCUPATION_TYPE()
# print(f"Test driver: \n{test_driver_prep_OCCUPATION_TYPE()[0:10, :]}")
# print(f"X_train['name'][0:10]: \n{X_train[features][0:10]}")
# # QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
list(datasets["application_train"].columns)
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT', 'ef_GOODS_PRICE_PERCENT', 'ef_CNT_NON_CHILDREN', 'ef_LIVINGAREA_LANDAREA_AVG_RATIO']
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
def pct(x):
return round(100*x,3)
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train Precision",
"Valid Precision",
"Test Precision",
"Train Recall",
"Valid Recall",
"Test Recall",
"Train Log Loss",
"Valid Log Loss",
"Test Log Loss",
"P Score",
"Train RMSE",
"Valid RMSE",
"Test RMSE",
"Train MAE",
"Valid MAE",
"Test MAE",
"Train Time",
"Valid Time",
"Test Time",
"Description"
])
appsTrainDF_agg.columns.to_list()
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT', 'ef_GOODS_PRICE_PERCENT', 'ef_CNT_NON_CHILDREN', 'ef_LIVINGAREA_LANDAREA_AVG_RATIO', 'prevApps_AMT_ANNUITY_min', 'prevApps_AMT_ANNUITY_max', 'prevApps_AMT_ANNUITY_mean', 'prevApps_AMT_APPLICATION_min', 'prevApps_AMT_APPLICATION_max', 'prevApps_AMT_APPLICATION_mean', 'prevApps_AMT_DOWN_PAYMENT_min', 'prevApps_AMT_DOWN_PAYMENT_max', 'prevApps_AMT_DOWN_PAYMENT_mean', 'prevApps_CNT_PAYMENT_min', 'prevApps_CNT_PAYMENT_max', 'prevApps_CNT_PAYMENT_mean', 'prevApps_RATE_INTEREST_PRIVILEGED_min', 'prevApps_RATE_INTEREST_PRIVILEGED_max', 'prevApps_RATE_INTEREST_PRIVILEGED_mean', 'prevApps_AMT_CREDIT_min', 'prevApps_AMT_CREDIT_max', 'prevApps_AMT_CREDIT_mean', 'prevApps_DAYS_FIRST_DRAWING_min', 'prevApps_DAYS_FIRST_DRAWING_max', 'prevApps_DAYS_FIRST_DRAWING_mean', 'prevApps_DAYS_LAST_DUE_min', 'prevApps_DAYS_LAST_DUE_max', 'prevApps_DAYS_LAST_DUE_mean', 'prevApps_DAYS_FIRST_DUE_min', 'prevApps_DAYS_FIRST_DUE_max', 'prevApps_DAYS_FIRST_DUE_mean', 'ef_prevApps_AMT_ANNUITY_range', 'ef_prevApps_AMT_APPLICATION_range', 'ef_prevApps_AMT_DOWN_PAYMENT_range', 'bureau_AMT_ANNUITY_min', 'bureau_AMT_ANNUITY_max', 'bureau_AMT_ANNUITY_mean', 'bureau_AMT_CREDIT_SUM_min', 'bureau_AMT_CREDIT_SUM_max', 'bureau_AMT_CREDIT_SUM_mean', 'bureau_DAYS_CREDIT_min', 'bureau_DAYS_CREDIT_max', 'bureau_DAYS_CREDIT_mean', 'bureau_AMT_CREDIT_SUM_OVERDUE_min', 'bureau_AMT_CREDIT_SUM_OVERDUE_max', 'bureau_AMT_CREDIT_SUM_OVERDUE_mean', 'bureau_CREDIT_DAY_OVERDUE_min', 'bureau_CREDIT_DAY_OVERDUE_max', 'bureau_CREDIT_DAY_OVERDUE_mean', 'bureau_AMT_CREDIT_SUM_DEBT_min', 'bureau_AMT_CREDIT_SUM_DEBT_max', 'bureau_AMT_CREDIT_SUM_DEBT_mean', 'bureau_AMT_CREDIT_SUM_LIMIT_min', 'bureau_AMT_CREDIT_SUM_LIMIT_max', 'bureau_AMT_CREDIT_SUM_LIMIT_mean', 'bureau_AMT_CREDIT_MAX_OVERDUE_min', 'bureau_AMT_CREDIT_MAX_OVERDUE_max', 'bureau_AMT_CREDIT_MAX_OVERDUE_mean', 'ef_bureau_AMT_DEBT_CREDIT_RATIO', 'ef_bureau_AMT_OVERDUE_CREDIT_RATIO', 'ef_bureau_AMT_CREDIT_SUM_range', 'ef_bureau_AMT_CREDIT_SUM_DEBT_range', 'ef_bureau_AMT_CREDIT_SUM_OVERDUE_range', 'credit_card_balance_MONTHS_BALANCE_min', 'credit_card_balance_MONTHS_BALANCE_max', 'credit_card_balance_MONTHS_BALANCE_mean', 'credit_card_balance_AMT_BALANCE_min', 'credit_card_balance_AMT_BALANCE_max', 'credit_card_balance_AMT_BALANCE_mean', 'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_min', 'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_max', 'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_mean', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_min', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_max', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean', 'credit_card_balance_AMT_INST_MIN_REGULARITY_min', 'credit_card_balance_AMT_INST_MIN_REGULARITY_max', 'credit_card_balance_AMT_INST_MIN_REGULARITY_mean', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_min', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_max', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean', 'credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_min', 'credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_max', 'credit_card_balance_CNT_DRAWINGS_ATM_CURRENT_mean', 'credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_min', 'credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_max', 'credit_card_balance_AMT_CREDIT_LIMIT_ACTUAL_mean', 'credit_card_balance_AMT_RECIVABLE_min', 'credit_card_balance_AMT_RECIVABLE_max', 'credit_card_balance_AMT_RECIVABLE_mean', 'credit_card_balance_AMT_TOTAL_RECEIVABLE_min', 'credit_card_balance_AMT_TOTAL_RECEIVABLE_max', 'credit_card_balance_AMT_TOTAL_RECEIVABLE_mean', 'credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_min', 'credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_max', 'credit_card_balance_AMT_RECEIVABLE_PRINCIPAL_mean', 'installments_pmnts_AMT_INSTALMENT_min', 'installments_pmnts_AMT_INSTALMENT_max', 'installments_pmnts_AMT_INSTALMENT_mean', 'installments_pmnts_AMT_PAYMENT_min', 'installments_pmnts_AMT_PAYMENT_max', 'installments_pmnts_AMT_PAYMENT_mean', 'installments_pmnts_DAYS_ENTRY_PAYMENT_min', 'installments_pmnts_DAYS_ENTRY_PAYMENT_max', 'installments_pmnts_DAYS_ENTRY_PAYMENT_mean', 'installments_pmnts_DAYS_INSTALMENT_min', 'installments_pmnts_DAYS_INSTALMENT_max', 'installments_pmnts_DAYS_INSTALMENT_mean', 'installments_pmnts_NUM_INSTALMENT_VERSION_min', 'installments_pmnts_NUM_INSTALMENT_VERSION_max', 'installments_pmnts_NUM_INSTALMENT_VERSION_mean', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_min', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_max', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_mean', 'pos_cash_balance_MONTHS_BALANCE_min', 'pos_cash_balance_MONTHS_BALANCE_max', 'pos_cash_balance_MONTHS_BALANCE_mean', 'pos_cash_balance_SK_DPD_DEF_min', 'pos_cash_balance_SK_DPD_DEF_max', 'pos_cash_balance_SK_DPD_DEF_mean']
train_dataset = appsTrainDF_agg
train_dataset.shape
(307511, 243)
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size
splits = 3
# Train Test split percentage
subsample_rate = 0.3
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
X train shape: (60989, 91) X validation shape: (10763, 91) X test shape: (30752, 91) X X_kaggle_test shape: (48744, 91)
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("logistic", LogisticRegression())
])
CPU times: user 6.45 ms, sys: 171 µs, total: 6.62 ms Wall time: 5.58 ms
Split the training data to 10 fold to perform Crossfold validation
# Import Model selection libraries
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
# Use ShuffleSplit() with 5 splits, 30% test_size and random state of 0
cvSplits = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
from time import time, ctime
from sklearn.metrics import log_loss, make_scorer
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Set up cross validation scores
logit_scores = cross_validate(model, X_train, y_train, cv=cvSplits, scoring=make_scorer(log_loss), return_train_score=True, n_jobs=-1)
train_time = np.round(time() - start, 4)
# Time and score valid predictions
start = time()
logit_score_valid = full_pipeline_with_predictor.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score test predictions
start = time()
logit_score_test = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)
logit_scores
{'fit_time': array([2.7547009 , 2.59689641, 2.81942034, 2.9246428 , 2.82481122]),
'score_time': array([0.22927523, 0.1289053 , 0.12815261, 0.13114214, 0.23704362]),
'test_score': array([2.87492912, 2.80508556, 2.80886073, 2.7503427 , 2.85416514]),
'train_score': array([2.7692843 , 2.80326314, 2.79679103, 2.82429781, 2.78465577])}
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, roc_auc_score
from sklearn.metrics import precision_score, recall_score
from scipy import stats
pd.set_option("display.max_rows", None, "display.max_columns", None)
y_train_pred = model.predict(X_train)
y_valid_pred = model.predict(X_valid)
y_test_pred = model.predict(X_test)
y_train_pred_prob = model.predict_proba(X_train)[:, 1]
y_valid_pred_prob = model.predict_proba(X_valid)[:, 1]
y_test_pred_prob = model.predict_proba(X_test)[:, 1]
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, y_train_pred),
accuracy_score(y_valid, y_valid_pred),
accuracy_score(y_test, y_test_pred),
roc_auc_score(y_train, y_train_pred_prob),
roc_auc_score(y_valid, y_valid_pred_prob),
roc_auc_score(y_test, y_test_pred_prob),
precision_score(y_train, y_train_pred),
precision_score(y_valid, y_valid_pred),
precision_score(y_test, y_test_pred),
recall_score(y_train, y_train_pred),
recall_score(y_valid, y_valid_pred),
recall_score(y_test, y_test_pred),
logit_scores['train_score'].mean(),
logit_scores['test_score'].mean(),
log_loss(y_test, model.predict(X_test)),
0, # p-value not relevant for Baseline model
np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_prob)), 3),
np.round(np.sqrt(mean_squared_error(y_valid, y_valid_pred_prob)), 3),
np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_prob)), 3),
np.round(mean_absolute_error(y_train, y_train_pred_prob), 3),
np.round(mean_absolute_error(y_valid, y_valid_pred_prob), 3),
np.round(mean_absolute_error(y_test, y_test_pred_prob), 3)], 4)) \
+ [train_time, valid_time, test_time] + [f"Baseline LR {len(selected_features)}"]
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | Train Precision | Valid Precision | Test Precision | Train Recall | Valid Recall | Test Recall | Train Log Loss | Valid Log Loss | Test Log Loss | P Score | Train RMSE | Valid RMSE | Test RMSE | Train MAE | Valid MAE | Test MAE | Train Time | Valid Time | Test Time | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_91_features | 0.9188 | 0.9191 | 0.9182 | 0.7615 | 0.7504 | 0.7567 | 0.4976 | 0.56 | 0.3895 | 0.0206 | 0.016 | 0.0148 | 2.7957 | 2.8187 | 2.8269 | 0.0 | 0.261 | 0.261 | 0.262 | 0.136 | 0.137 | 0.136 | 8.4338 | 0.2404 | 0.4999 | Baseline LR 91 |
# roc curve for each model
fprs, tprs, names, scores, cvscores, pvalues, accuracy, cnfmatrix = list(), list(), list(), list(), list(), list(), list(), list()
features_list, final_best_clf, results = {}, {}, []
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
def confusion_matrix_def(model, X_train, y_train, X_test, y_test, X_valid, y_valid, cnfmatrix):
#Prediction
preds_test = model.predict(X_test)
preds_train = model.predict(X_train)
preds_valid = model.predict(X_valid)
cm_train = confusion_matrix(y_train, preds_train).astype(np.float32)
#print(cm_train)
cm_train /= cm_train.sum(axis=1)[:, np.newaxis]
cm_test = confusion_matrix(y_test, preds_test).astype(np.float32)
#print(cm_test)
cm_test /= cm_test.sum(axis=1)[:, np.newaxis]
cm_valid = confusion_matrix(y_valid, preds_valid).astype(np.float32)
cm_valid /= cm_valid.sum(axis=1)[:, np.newaxis]
#class_labels = ['No Default','Default']
plt.figure(figsize=(24, 8))
plt.subplot(131)
g = sns.heatmap(cm_train, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Train", fontsize=14)
plt.subplot(132)
g = sns.heatmap(cm_test, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Test", fontsize=14)
plt.subplot(133)
g = sns.heatmap(cm_valid, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Validation", fontsize=14) ;
# Confusion matrix for baseline model
confusion_matrix_def(model, X_train, y_train, X_test, y_test, X_valid, y_valid, cnfmatrix)
plt.show()
from sklearn.metrics import roc_curve, plot_roc_curve
def roc_curve_plot(model, X_train, y_train, X_test, y_test, X_valid, y_valid, fprs, tprs, name):
fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
fprs.append(fpr)
tprs.append(tpr)
# plot combined ROC curve for train, valid, test
train_roc_plot = plot_roc_curve(model, X_train, y_train, name="TrainRocAuc")
test_roc_plot = plot_roc_curve(model, X_test, y_test, name="TestRocAuc", ax=train_roc_plot.ax_)
valid_roc_plot = plot_roc_curve(model, X_valid, y_valid, name="ValidRocAuc", ax=test_roc_plot.ax_)
valid_roc_plot.ax_.set_title ("ROC Curve Comparison - " + name)
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.show()
return fprs, tprs
_,_ = roc_curve_plot(model, X_train, y_train, X_test, y_test, X_valid, y_valid
, fprs, tprs, "Baseline Logistic Regression Model")
The baseline Logistic Regression model was tuned across different parameters evaluated for the following metrics:
import json
from sklearn.naive_bayes import GaussianNB
# from sklearn.svm import SVC # Not implementing due to technical constraints
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
classifiers = [
[('Logistic Regression', LogisticRegression(random_state=42))],
[('Naive Bayes', GaussianNB())],
# [('Support Vector', SVC(random_state=42,probability=True),"SVM")],
[('Gradient Boosting', GradientBoostingClassifier(random_state=42))],
[('XGBoost', XGBClassifier(random_state=42))],
[('DecisionTrees', DecisionTreeClassifier(random_state=42))],
[('RandomForest', RandomForestClassifier(random_state=42))]
]
params_grid = {
'Logistic Regression': {
'penalty': ('l1', 'l2'),
'tol': [0.0001],
'C': (0.01, 0.001, 0.0001),
},
'Naive Bayes': {
'var_smoothing': [1e-8, 1e-9, 1e-10]
},
# 'Support Vector' : {
# 'kernel': ('rbf','poly'),
# 'degree': (4, 5),
# 'C': ( 0.001, 0.01), #Low C - allow for misclassification
# 'gamma':(0.01,0.1,1) #Low gamma - high variance and low bias
# },
'Gradient Boosting': {
'max_depth': [5,10],
'max_features': [10,15],
'n_iter_no_change': [5],
'tol': (0.001, 0.0001),
'n_estimators': [500],
'subsample': [0.85],
'min_samples_leaf' : [3,5]
},
'XGBoost': {
'max_depth': [3,5], # Lower helps with overfitting
'n_estimators': [1000],
'objective': ['binary:logistic'],
'eta' : [0.01,0.1],
# 'colsample_bytree' : [0.2,0.5],
},
'DecisionTrees' : {
'criterion': ['gini','entropy'],
'max_depth': range(1,5),
'min_samples_leaf': range(1,5)
},
'RandomForest': {
'max_depth': [5,10],
'max_features': [10,15],
'min_samples_leaf': [3],
'min_impurity_decrease': [1e-3,1e-4,1e-6],
'n_estimators': [1000]
}
}
Import necesssary libraries to determine feature importance for different classifiers:
def runGridSearch(classifiers, cnfmatrix, fprs, tprs):
for (name, classifier) in classifiers:
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline for each classifier
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Best estimator score
best_train = pct(grid_search.best_score_)
# Best train scores
print("Cross validation with best estimator")
best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=make_scorer(log_loss),
return_train_score=True, n_jobs=-1)
#get all scores
# best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
best_train_logloss = np.round(best_train_scores['train_score'].mean(),4)
# best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
valid_time = np.round(best_train_scores['score_time'].mean(),4)
# best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
best_valid_logloss = np.round(best_train_scores['test_score'].mean(),4)
# best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
# Conduct t-test with baseline logit (control) and best estimator (experiment)
(t_stat, p_value) = stats.ttest_rel(logit_scores['train_score'], best_train_scores['train_score'])
#test and Prediction with whole data
# Best estimator fitting time
print("Fit and Prediction with best estimator")
start = time()
model = grid_search.best_estimator_.fit(X_train, y_train)
train_time = round(time() - start, 4)
# Best estimator prediction time
start = time()
y_test_pred = model.predict(X_test)
test_time = round(time() - start, 4)
scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
accuracy.append(accuracy_score(y_test, y_test_pred))
# Create confusion matrix for the best model
cnfmatrix = confusion_matrix_def(model, X_train, y_train, X_test, y_test, X_valid, y_valid, cnfmatrix)
# Create AUC ROC curve
fprs, tprs = roc_curve_plot(model, X_train, y_train, X_test, y_test, X_valid, y_valid, fprs, tprs, name)
#Best Model
final_best_clf[name] = pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
'predictor': grid_search.best_estimator_.named_steps['predictor']}])
# #Feature importance
# feature_name = num_attribs + list(grid_search.best_estimator_.named_steps['preparation'].transformers[1][1].named_steps['selector'].attribute_names)
# feature_list = feature_name
#append all results
results.append(accuracy_score(y_train, model.predict(X_train)))
names.append(name)
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",name," *****")
print("")
# Record the results
y_train_pred = model.predict(X_train)
y_valid_pred = model.predict(X_valid)
y_test_pred = model.predict(X_test)
y_train_pred_prob = model.predict_proba(X_train)[:, 1]
y_valid_pred_prob = model.predict_proba(X_valid)[:, 1]
y_test_pred_prob = model.predict_proba(X_test)[:, 1]
exp_name = name
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, y_train_pred),
accuracy_score(y_valid, y_valid_pred),
accuracy_score(y_test, y_test_pred),
roc_auc_score(y_train, y_train_pred_prob),
roc_auc_score(y_valid, y_valid_pred_prob),
roc_auc_score(y_test, y_test_pred_prob),
precision_score(y_train, y_train_pred),
precision_score(y_valid, y_valid_pred),
precision_score(y_test, y_test_pred),
recall_score(y_train, y_train_pred),
recall_score(y_valid, y_valid_pred),
recall_score(y_test, y_test_pred),
best_train_logloss,
best_valid_logloss,
log_loss(y_test, y_test_pred),
p_value,
np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_prob)), 3),
np.round(np.sqrt(mean_squared_error(y_valid, y_valid_pred_prob)), 3),
np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_prob)), 3),
np.round(mean_absolute_error(y_train, y_train_pred_prob), 3),
np.round(mean_absolute_error(y_valid, y_valid_pred_prob), 3),
np.round(mean_absolute_error(y_test, y_test_pred_prob), 3)], 4)) \
+ [train_time,valid_time,test_time] \
+ [json.dumps(param_dump)]
runGridSearch(classifiers[0], cnfmatrix, fprs, tprs)
****** START Logistic Regression *****
Parameters:
C: (0.01, 0.001, 0.0001)
penalty: ('l1', 'l2')
tol: [0.0001]
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters: predictor__C: 0.01 predictor__penalty: l2 predictor__tol: 0.0001 ****** FINISH Logistic Regression *****
runGridSearch(classifiers[1], cnfmatrix, fprs, tprs)
****** START Naive Bayes ***** Parameters: var_smoothing: [1e-08, 1e-09, 1e-10] Fitting 5 folds for each of 3 candidates, totalling 15 fits Cross validation with best estimator Fit and Prediction with best estimator
Best Parameters: predictor__var_smoothing: 1e-08 ****** FINISH Naive Bayes *****
runGridSearch(classifiers[2], cnfmatrix, fprs, tprs)
****** START Gradient Boosting ***** Parameters: max_depth: [5, 10] max_features: [10, 15] min_samples_leaf: [3, 5] n_estimators: [500] n_iter_no_change: [5] subsample: [0.85] tol: (0.001, 0.0001) Fitting 5 folds for each of 16 candidates, totalling 80 fits Cross validation with best estimator Fit and Prediction with best estimator
Best Parameters: predictor__max_depth: 5 predictor__max_features: 10 predictor__min_samples_leaf: 5 predictor__n_estimators: 500 predictor__n_iter_no_change: 5 predictor__subsample: 0.85 predictor__tol: 0.0001 ****** FINISH Gradient Boosting *****
classifiers[3]
[('XGBoost',
XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
random_state=42, reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))]
runGridSearch(classifiers[3], cnfmatrix, fprs, tprs)
****** START XGBoost ***** Parameters: eta: [0.01, 0.1] max_depth: [3, 5] n_estimators: [1000] objective: ['binary:logistic'] Fitting 5 folds for each of 4 candidates, totalling 20 fits [18:55:30] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. Cross validation with best estimator Fit and Prediction with best estimator [19:04:16] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Best Parameters: predictor__eta: 0.01 predictor__max_depth: 5 predictor__n_estimators: 1000 predictor__objective: binary:logistic ****** FINISH XGBoost *****
classifiers[4]
[('DecisionTrees', DecisionTreeClassifier(random_state=42))]
runGridSearch(classifiers[4], cnfmatrix, fprs, tprs)
****** START DecisionTrees ***** Parameters: criterion: ['gini', 'entropy'] max_depth: range(1, 5) min_samples_leaf: range(1, 5) Fitting 5 folds for each of 32 candidates, totalling 160 fits Cross validation with best estimator Fit and Prediction with best estimator
Best Parameters: predictor__criterion: entropy predictor__max_depth: 4 predictor__min_samples_leaf: 1 ****** FINISH DecisionTrees *****
runGridSearch(classifiers[5], cnfmatrix, fprs, tprs)
****** START RandomForest ***** Parameters: max_depth: [5, 10] max_features: [10, 15] min_impurity_decrease: [0.001, 0.0001, 1e-06] min_samples_leaf: [3] n_estimators: [1000] Fitting 5 folds for each of 12 candidates, totalling 60 fits Cross validation with best estimator Fit and Prediction with best estimator
Best Parameters: predictor__max_depth: 10 predictor__max_features: 10 predictor__min_impurity_decrease: 1e-06 predictor__min_samples_leaf: 3 predictor__n_estimators: 1000 ****** FINISH RandomForest *****
print('Final experiment results:')
expLog
Final experiment results:
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | Train Precision | Valid Precision | Test Precision | Train Recall | Valid Recall | Test Recall | Train Log Loss | Valid Log Loss | Test Log Loss | P Score | Train RMSE | Valid RMSE | Test RMSE | Train MAE | Valid MAE | Test MAE | Train Time | Valid Time | Test Time | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_91_features | 0.9188 | 0.9191 | 0.9182 | 0.7615 | 0.7504 | 0.7567 | 0.4976 | 0.5600 | 0.3895 | 0.0206 | 0.0160 | 0.0148 | 2.7957 | 2.8187 | 2.8269 | 0.0000 | 0.261 | 0.261 | 0.262 | 0.136 | 0.137 | 0.136 | 8.4338 | 0.2404 | 0.4999 | Baseline LR 91 |
| 1 | Logistic Regression | 0.9191 | 0.9188 | 0.9185 | 0.7589 | 0.7490 | 0.7561 | 0.5592 | 0.5000 | 0.4286 | 0.0172 | 0.0114 | 0.0132 | 2.7889 | 2.8107 | 2.8157 | 0.0477 | 0.261 | 0.261 | 0.262 | 0.137 | 0.137 | 0.137 | 2.3704 | 0.1238 | 0.3695 | [["predictor__C", 0.01], ["predictor__penalty"... |
| 2 | Naive Bayes | 0.1960 | 0.1929 | 0.1944 | 0.6583 | 0.6522 | 0.6489 | 0.0867 | 0.0861 | 0.0862 | 0.9340 | 0.9302 | 0.9303 | 27.8160 | 27.8216 | 27.8264 | 0.0000 | 0.887 | 0.889 | 0.888 | 0.801 | 0.804 | 0.802 | 1.2817 | 0.1896 | 0.3890 | [["predictor__var_smoothing", 1e-08]] |
| 3 | Gradient Boosting | 0.9235 | 0.9197 | 0.9185 | 0.8244 | 0.7582 | 0.7580 | 0.8198 | 0.5926 | 0.4636 | 0.0735 | 0.0366 | 0.0280 | 2.6279 | 2.8179 | 2.8157 | 0.0000 | 0.248 | 0.260 | 0.262 | 0.129 | 0.136 | 0.136 | 11.6299 | 0.1884 | 0.4543 | [["predictor__max_depth", 5], ["predictor__max... |
| 4 | XGBoost | 0.9230 | 0.9194 | 0.9190 | 0.8563 | 0.7592 | 0.7619 | 0.9132 | 0.6000 | 0.5200 | 0.0574 | 0.0240 | 0.0208 | 2.6005 | 2.8024 | 2.7989 | 0.0000 | 0.244 | 0.260 | 0.261 | 0.127 | 0.135 | 0.136 | 91.5141 | 0.4896 | 0.4168 | [["predictor__eta", 0.01], ["predictor__max_de... |
| 5 | DecisionTrees | 0.9188 | 0.9188 | 0.9188 | 0.7059 | 0.6966 | 0.6896 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 2.7960 | 2.8183 | 2.8034 | 0.8562 | 0.266 | 0.267 | 0.267 | 0.141 | 0.142 | 0.142 | 2.8795 | 0.1263 | 0.3351 | [["predictor__criterion", "entropy"], ["predic... |
| 6 | RandomForest | 0.9193 | 0.9188 | 0.9188 | 0.8884 | 0.7454 | 0.7502 | 1.0000 | 0.0000 | 0.0000 | 0.0057 | 0.0000 | 0.0000 | 2.7792 | 2.8111 | 2.8034 | 0.0001 | 0.248 | 0.264 | 0.264 | 0.133 | 0.142 | 0.142 | 169.8257 | 2.6308 | 4.1866 | [["predictor__max_depth", 10], ["predictor__ma... |
Set-up function to be used for Gradient Boosting and Decision Tree models. Logistic regression has a slightly different logic (which is coded in the section immediately below) so it does not use the function below.
def findFeatImportance(name, classifier):
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline for each classifier
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Get importance
importances = grid_search.best_estimator_.named_steps["predictor"].feature_importances_
return importances
features = list(X_train.columns)
name = classifiers[0][0][0]
print(name)
classifier = classifiers[0][0][1]
print(classifier)
Logistic Regression LogisticRegression(random_state=42)
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline for each classifier
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Get importance
importances = grid_search.best_estimator_.named_steps["predictor"].coef_[0]
****** START Logistic Regression *****
Parameters:
C: (0.01, 0.001, 0.0001)
penalty: ('l1', 'l2')
tol: [0.0001]
Fitting 5 folds for each of 6 candidates, totalling 30 fits
lr_indices = np.argsort(abs(importances))[::-1]
lr_indices[:11]
array([ 2, 3, 10, 1, 0, 85, 84, 23, 22, 59, 54])
plt.title('Feature Importances - ' + name)
plt.barh(range(11), abs(importances[lr_indices[:11]]), color='b', align='center')
plt.yticks(range(11), [features[i] for i in lr_indices[:11]])
plt.xlabel('Relative Importance')
plt.grid()
plt.show()
name = classifiers[2][0][0]
print(name)
classifier = classifiers[2][0][1]
print(classifier)
Gradient Boosting GradientBoostingClassifier(random_state=42)
gb_importances = findFeatImportance(name, classifier)
****** START Gradient Boosting ***** Parameters: max_depth: [5, 10] max_features: [10, 15] min_samples_leaf: [3, 5] n_estimators: [500] n_iter_no_change: [5] subsample: [0.85] tol: (0.001, 0.0001) Fitting 5 folds for each of 16 candidates, totalling 80 fits
gb_indices = np.argsort(gb_importances)[::-1]
gb_importances[gb_indices[:11]]
array([0.15220906, 0.13524995, 0.05301033, 0.02882706, 0.02499978,
0.02487445, 0.01939031, 0.01784891, 0.01543193, 0.01470172,
0.01434265])
plt.title('Feature Importances - ' + name)
plt.barh(range(11), gb_importances[indices[:11]], color='b', align='center')
plt.yticks(range(11), [features[i] for i in gb_indices[:11]])
plt.xlabel('Relative Importance')
plt.grid()
plt.show()
name = classifiers[3][0][0]
print(name)
classifier = classifiers[3][0][1]
print(classifier)
XGBoost
XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
random_state=42, reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None)
xg_importances = findFeatImportance(name, classifier)
# Unable to run, taking too long
****** START XGBoost ***** Parameters: eta: [0.01, 0.1] max_depth: [3, 5] n_estimators: [1000] objective: ['binary:logistic'] Fitting 5 folds for each of 4 candidates, totalling 20 fits
KeyboardInterruptTraceback (most recent call last) <ipython-input-325-8329bfc4a488> in <module> ----> 1 xg_importances = findFeatImportance(name, classifier) <ipython-input-300-a11484d422da> in findFeatImportance(name, classifier) 20 grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc', 21 n_jobs=-1,verbose=1) ---> 22 grid_search.fit(X_train, y_train) 23 24 importances = grid_search.best_estimator_.named_steps["predictor"].feature_importances_ /usr/local/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 /usr/local/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params) 839 return results 840 --> 841 self._run_search(evaluate_candidates) 842 843 # multimetric is determined here because in the case of a callable /usr/local/lib/python3.7/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates) 1294 def _run_search(self, evaluate_candidates): 1295 """Search all candidates in param_grid""" -> 1296 evaluate_candidates(ParameterGrid(self.param_grid)) 1297 1298 /usr/local/lib/python3.7/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results) 807 (split_idx, (train, test)) in product( 808 enumerate(candidate_params), --> 809 enumerate(cv.split(X, y, groups)))) 810 811 if len(out) < 1: /usr/local/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable) 1052 1053 with self._backend.retrieval_context(): -> 1054 self.retrieve() 1055 # Make sure that we get a last message telling us we are done 1056 elapsed_time = time.time() - self._start_time /usr/local/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self) 931 try: 932 if getattr(self._backend, 'supports_timeout', False): --> 933 self._output.extend(job.get(timeout=self.timeout)) 934 else: 935 self._output.extend(job.get()) /usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 540 AsyncResults.get from multiprocessing.""" 541 try: --> 542 return future.result(timeout=timeout) 543 except CfTimeoutError as e: 544 raise TimeoutError from e /usr/local/lib/python3.7/concurrent/futures/_base.py in result(self, timeout) 428 return self.__get_result() 429 --> 430 self._condition.wait(timeout) 431 432 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]: /usr/local/lib/python3.7/threading.py in wait(self, timeout) 294 try: # restore state no matter what (e.g., KeyboardInterrupt) 295 if timeout is None: --> 296 waiter.acquire() 297 gotit = True 298 else: KeyboardInterrupt:
xg_indices = np.argsort(xg_importances)[::-1]
plt.title('Feature Importances - ' + name)
plt.barh(range(7), xg_importances[xg_indices[:7]], color='b', align='center')
plt.yticks(range(7), [features[i] for i in xg_indices[:7]])
plt.xlabel('Relative Importance')
plt.grid()
plt.show()
name = classifiers[4][0][0]
print(name)
classifier = classifiers[4][0][1]
print(classifier)
DecisionTrees DecisionTreeClassifier(random_state=42)
dt_importances = findFeatImportance(name = classifiers[4][0][0], classifier = classifiers[4][0][1])
****** START DecisionTrees ***** Parameters: criterion: ['gini', 'entropy'] max_depth: range(1, 5) min_samples_leaf: range(1, 5) Fitting 5 folds for each of 32 candidates, totalling 160 fits
dt_indices = np.argsort(dt_importances)[::-1]
plt.title('Feature Importances - ' + name)
plt.barh(range(7), dt_importances[dt_indices[:7]], color='b', align='center')
plt.yticks(range(7), [features[i] for i in dt_indices[:7]])
plt.xlabel('Relative Importance')
plt.grid()
plt.show()
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
name = classifiers[2][0][0]
print(name)
classifier = classifiers[2][0][1]
print(classifier)
Gradient Boosting GradientBoostingClassifier(random_state=42)
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline for each classifier
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
print("Fit and Prediction with best estimator")
model = grid_search.best_estimator_.fit(X_train, y_train)
print('****** END', name,'*****')
****** START Gradient Boosting ***** Parameters: max_depth: [5, 10] max_features: [10, 15] min_samples_leaf: [3, 5] n_estimators: [500] n_iter_no_change: [5] subsample: [0.85] tol: (0.001, 0.0001) Fitting 5 folds for each of 16 candidates, totalling 80 fits Fit and Prediction with best estimator ****** END Gradient Boosting *****
test_class_scores = model.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
array([0.08014695, 0.08202941, 0.03076722, 0.02388138, 0.12220071,
0.03362661, 0.02312744, 0.0428502 , 0.02653686, 0.08951652])
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.080147 |
| 1 | 100005 | 0.082029 |
| 2 | 100013 | 0.030767 |
| 3 | 100028 | 0.023881 |
| 4 | 100038 | 0.122201 |
submit_df.to_csv("submission_P2.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission_P2.csv -m "Phase 2 XGBoost submission"
100%|███████████████████████████████████████| 1.26M/1.26M [00:02<00:00, 460kB/s] Successfully submitted to Home Credit Default Risk

The HCDR project aims to create a machine learning model that can accurately predict customer defaulting on loan repayment. In Phase 1, we developed a baseline logistic regression model to achieve a ROC_AUC score of 0.74306.
In Phase 2, we wanted to improve our performance with new features and evaluate other algorithms. We engineered additional features and performed Grid Search with six classification algorithms to tune hyperparameters. XGBoost performed the best with highest test accuracy of 91.90%, AUC of 76.19%, and better precision and recall scores. Gradient Boosting came very close with accuracy and AUC scores but slightly underperformed relative to XGBoost in precision and recall. Naive Bayes performed the worst among all models with lowest accuracy at 19.5% and highest log loss at 27.8. Decision Trees and Random Forest performed no better than baseline.
Our best ROC_AUC score for Kaggle submission was 0.74779.
Home Credit is an international non-bank financial institution that aims to lend people money regardless of their credit history. Home credit groups focus on providing a positive borrowing experience for customers who do not bank on traditional sources. Thus, Home Credit Group published a dataset on Kaggle with the goal of identifying and solving unfair loan rejection.
The purpose of this project is to create a machine learning model which can accurately predict the customer behavior on repayment of the loan. Our task is to form a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated using a number of different performance metrics that we can use to create a better model. Businesses can use this model to identify if a loan is at risk to default. The new model that is built will ensure that the clients who are capable of repaying their loans are not rejected and that loans would be given with a principal, maturity, and repayment calendar that will allow their clients to be successful.
The results of the machine learning pipelines are measured by using these metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Accuracy Score, Precision, Recall, Confusion Matrix, and Area Under ROC Curve (AUC).
The results of our pipelines will be analyzed and ranked. The most efficient pipeline will be submitted to the Kaggle competition for the Home Credit Default Risk (HCDR).
Workflow
We are implementing the following workflow outlined below. In Phase 0, we understood the project modelling requirements and outlined our plans. In Phase 1, we are performing the first among three iterations of the remainder of the workflow

The dataset contains 1 primary table and 6 seconday tables. \ \ Primary Tables
application_train \ This Primary table includes the application information for each loan application at Home Credit in one row. This row includes the target variable of whether or not the loan was repaid. We use this field as the basis to determine the feature importance. The target variable is binary in nature based since this is a classification problem. \ \ The target variable takes on two different values:
application_test \ This table includes the application information for each loan application at Home Credit in one row. The features are the same as the train data but exclude the target variable. \ \ There are 121 variables and 48,744 data entries.
Secondary Tables
Bureau \ This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 17 variables and 1,716,428 data entries.
Bureau Balance \ This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU. \ \ There are 3 variables and 27,299,925 data entries.
Previous Application \ This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. There are four types of contracts: a. Consumer loan(POS – Credit limit given to buy consumer goods) b. Cash loan(Client is given cash) c. Revolving loan(Credit) d. XNA (Contract type without values) \ \ There are 37 variables and 1,670,214 data entries.
POS CASH Balance \ This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 10,001,358 data entries.
Credit Card Balance \ This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 23 variables and 3,840,312 data entries.
Installments Payments \ This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 13,605,401 data entries.
The following data preprocessing tasks need to be achieved to prepare the datasets after downloading and unzipping the main application and secondary datasets:
For the Exploratory Data Analysis component of this phase, we did a precursor analysis on the data to ensure that our results would be accurate.
We looked at summary statistics for each table in the model. We primarily focused on the data distribution, identifying statistics such as the count, mean, standard deviation, minimum, IQR, and maximum.
We also looked at specific numerical and categorical features and visualized them. We created a heatmap to identify the correlation between each feature and the target variable. We also visualized the age, occupation, and distribution of credit amounts.
Please see the Exploratory Data Analysis section for our complete EDA.
In our feature engineering process, we created two types of features to enhance our dataset. First, we created new aggregate features based on aggregate functions to capture the minimum, maximum, and mean of numerical attributes across the primary and secondary datasets that were highly correlated with the target variable.

In Phase 2, we decided to engineer the following new features from the Application and Bureau datasets:
Application_Train:
Bureau: (the last 3 variables are range calculations that take the difference between the max and min aggregate values)
Similar to Phase 1, we identified the highly correlated features by creating a simple function that took a secondary dataframe name as an input variable and generated a correlation matrix between all the features in the inputted dataframe and the primary dataset's target variable.
All the aggregate values were calculated from the original dataframes and a new of dataframes (comprising of primary and secondary datasets) were generated. After the secondary datasets were merged with the primary "application_train" dataset, the new consolidated application training dataframe had a total of 240 features (including the aggregate calculations for specific features).
Further, the top highly correlated features (positive and negative) were chosen from both the primary and secondary datasets. These features were then classified into numerical and categorical variables to form inputs for 2 individual pipelines. In total, our baseline model comprised of 91 features (84 numerical and 7 categorical features).
(Please see Feature Engineering section and Feature Aggregator for more details)
In Phase 1, we implemented Logistic Regression as a starting baseline model due to its easy implementation and low computational requirements. We used 5 fold cross-validation along with the hyperparameters to tune the model with GridSearchCV function in Scikit-learn.
Here is the high-level workflow for the model pipeline followed by detailed steps:

The rationale for the other classifier models are listed below:
We retained many of the data preprocessing procedures and data pipeline skeletal code from Phase 1. We augmented our feature engineering steps to build new features and developed a Grid Search function to tune hyperparameters and determine evaluation metrics for each classifier algorithm listed above.
Here are the experiment results for our baseline Logistic Regression model and six other classification algorithms we fine tuned. The RMSE and MAE scores are not included in the image below but can be found in the Experiment Results section.

Furthermore, we analyzed the feature importances section of Logistic Regression, Gradient Boosting, and Decision Tree models. Though XGBoost had the best overall performance in terms of accuracy, AUC, precision, and recall, we couldn't produce a chart showing feature importance due to our kernel taking too long so we have chosen to analyze the feature importance scores from Gradient Boosting, which performed very close to XGBoost:

From the feature importance chart above, the external source scores (EXT_SOURCE_3, EXT_SOURCE_2, EXT_SOURCE_1) followed by DAYS_BIRTH and DAYS_CREDIT (from bureau dataset) are the most predictive features of the target variable.
Since HCDR is a Classification task, we used the following metrics to measure the Model performance.
MAE
The mean absolute error is the average of the absolute values of individual prediction errors over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.
$$ \text{MAE}(\mathbf{X}, h_{\mathbf{\theta}}) = \dfrac{1}{m} \sum\limits_{i=1}^{m}{| \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)}|} $$RMSE
This root mean square error is the normalized distance between the vector of predicted values and the vector of observed values. First, the squared difference between each observed value and predicted value is calculated. RMSE is the square root of the summation of these squared differences.
$$ \text{RMSE}(\mathbf{X}, h_{\mathbf{\theta}}) = \sqrt{\dfrac{1}{m} \sum\limits_{i=1}^{m}{( \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)})^2}} $$Accuracy Score
This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn.

Precision
The precision is the ratio of true positives over the total number of predicted positives.

Recall
The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0

Confusion Matrix
The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.

AUC (Area under ROC curve)
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: ▪ True Positive Rate ▪ False Positive Rate

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).

AUC is desirable for the following two reasons:
Binary cross-entropy loss (CXE)
Binary cross-entropy loss (CXE) measures the performance of a classification model as a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label. Therefore, the objective function would need to minimize the binary CXE loss function.
The log loss formula for the binary case is as follows :
$$ -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $$p-value
p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.
Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.
We started our experimentation with our Phase 1 baseline Logistic Regression model but with additional features from different datasets. Based on the results above, we received high accuracy scores as in Phase 1 at around 91.9% while our AUC values continued to stay around 75%. Our train data precision score was at 50% while recall score stood at 2%. When we evaluated our baseline model with the best hyperparameters, we did not observe significant improvement in our evaluation metrics.
When we ran the other classification algorithms, we found that XGBoost resulted in the best model achieving a higher test accuracy score of 92%, test AUC of 76%, and better precision and recall scores. Gradient Boosting came very close in terms of accuracy and AUC but slightly underperformed in precision and recall relative to XGBoost. Decision Tree and Random Forest (the latter being an ensemble method) did not achieve much improvement relative to our baseline model. In fact, Decision Tree did not achieve statistical significance based on the p-score (0.85).
Our worst performing model was Naive Bayes with a very low accuracy score hovering at 19.5%. We believe this has to do with the intrinsic nature of NB which operates on conditional and unconditional probabilities associated with features and not on feature weights. Another factor to consider is the presence of features that are not necessarily normally distributed.
From our Feature Importance analysis, we found that the external scores play a significant predictive role in determining risk of default. Features from the Bureau dataset and our engineered features like 'ef_ANNUAL_INCOME_PCT' enhanced our model performance.
For our Kaggle submission, we used the XGBoost with best parameters since the test accuracy was the best among all algorithms.
In the Home Credit Default Risk (HCDR) project, we are using Home Credit’s data to better predict loan repayment by customers with little to no credit history. In Phase 1, we developed a baseline logistic regression algorithm.
In Phase 2, we engineered new features from the bureau datasets. We performed Grid Search on six different models: Logistic Regression, Naive Bayes, Gradient Boosting, XGBoost, Decision Trees, and Random Forest. Our best performing model was XGBoost with a test accuracy of 91.90% and AUC ROC score of 76.19%. All the other models had lower results, but Gradient Boosting came very close with a test AUC_ROC score of 75.80%. The worst performing model was Naive Bayes. The ROC_AUC score for our Phase 2 Kaggle submission was 0.74779 (from Gradient Boosting), an improvement over our Phase 1 score of 0.74306.
In Phase 3, we plan to examine our feature engineering process and determine if we can increase our Kaggle AUC score with fewer features. In order to circumvent the technical challenges, we will attempt to implement PyTorch along with SVM and other models using IU Red resources.
The challenges we faced in this phase were a continuation of those we experienced in Phase 1. We had to think hard about designing relevant features that would prove useful. As we engineered new features, we had to troubleshoot errors related to invalid calculations such as divide by zero errors. We had to constantly remind ourselves to follow the sequence of performing aggregate calculations and then engineering new features on top of them (and not the other way around). This meant we needed to be specific on which aggregate feature calculations we wanted to engineer new features from.
Our team was not able to implement the Support Vector Machine classifier successfully. All four of us tried and ended up crashing our Jupyter kernels (despite increasing our resources in Docker). This was a major roadblock as we wanted to compare another non-ensemble model like SVM's performance against Logistic Regression. In addition, we also couldn't re-implement XGBoost for Kaggle submission so we had to submit our results for Gradient Boosting model.
Along the way, we faced several technical issues in developing this notebook:
Below is the screenshot of our best kaggle submission.

We referred to the following resources to understand the algorithms and hyperparameters to modify:
Read the following: